Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

toxigen task measures toxicity classification rather than whether generations are toxic? #974

Open
laphang opened this issue Nov 8, 2023 · 7 comments

Comments

@laphang
Copy link

laphang commented Nov 8, 2023

I would like to evaluate the toxicity of the generations of my fine tuned models. I'm interested in using toxigen which seems popular (eg used in the Llama 2 paper).

However, looking at the current toxigen task, it seems to measure how well the LLM performs as a classifier when using the prompt in doc_to_tex()? (is my understanding of that correct?)
https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/toxigen.py

However, here's the relevant section (Appendix A.4.7) from the Llama 2 paper (which follows on from Section 4.1)
https://arxiv.org/pdf/2307.09288.pdf

Toxicity. To measure the degree of generation of toxic language and hate speech across different groups, we
use ToxiGen (Hartvigsen et al., 2022), a dataset that contains implicitly toxic and benign sentences mentioning
13 minority groups. We adopt a revised version of the dataset from Hosseini et al. (2023) that reduces noise
by filtering out prompts for which annotators disagree on the target demographic group. We then use the
default ToxiGen classifier tuned on RoBERTa (Liu et al., 2019) to measure the toxicity of generations of each
of the LLMs.

They generate text from the LLM using the prompts from (a revised version of) the toxigen dataset, and then use a roberta classifier trained on the toxigen dataset to evaluate whether the generated outputs are toxic.

eg. the lead author of the toxigen paper has uploaded this toxigen roberta classifier
https://huggingface.co/tomh/toxigen_roberta

  1. Assuming my understanding of the way the toxigen task in eval harness is currently working is correct, I was wondering if the task could be modified (or a new task created) which used a toxigen roberta classifier to measure whether the LLM generations are toxic?

  2. Alternatively, are there any other tasks that are useful in terms of measuring the toxicity of LLM generations?

@baberabb
Copy link
Contributor

baberabb commented Nov 9, 2023

Hi! There's Real Toxicity in the big-refactor (soon to be main) branch which evaluates the generations with the Perspective API (need a key but it's free) using a custom metric.py. Think you should also be able to modify this to pass the model outputs to an arbitrary classifier rather than the API and/or also use a different dataset in the task yaml.

@laphang
Copy link
Author

laphang commented Nov 9, 2023

Thanks for the response, I'll keen an eye on the big-refactor getting merged into main!

@laphang laphang closed this as completed Nov 9, 2023
@danihinjos
Copy link

Hi @laphang, did you come up with code to implement this? I'm exactly in the same position at the moment. Thanks!

@laphang
Copy link
Author

laphang commented Jan 30, 2024

Hi @danihinjos, no I didn't end up implementing this unfortunately.

I did try the real toxicity task, but found the perspective api too slow.

As an aside, I did gain more confidence in this current method after reading the Orca 2 paper, where they discuss using toxigen both for discriminative evaluation (the classification used in the current lm-eval toxigen task) and generative evaluation. And that models that were good at discriminative evaluation tended to generate less toxic outputs.
https://www.microsoft.com/en-us/research/publication/orca-2-teaching-small-language-models-how-to-reason/

But it would be good to be able to explicitly do the generative evaluation too.

@tea-shaped
Copy link

Agree, I would also find an implementation of toxigen super useful. Did you find another way outside of lm-evaluation-harness to run toxigen (relatively) easily?

@haileyschoelkopf
Copy link
Contributor

@Thartvigsen wondering if you've got an implementation of evaluating causal models' toxicity (as opposed to their use for classifying toxicity which you contributed) using Toxigen, possibly?

@Thartvigsen
Copy link
Contributor

@haileyschoelkopf ah sorry I missed this. The ToxiGen dataset just contains sentences w/ binary labels (hate vs. not hate) so I don't think it can be directly used for eval unless the data are restructured into pairs of hateful/non-hateful sentences? The score could be whether a causal model finds the hateful sentences much less likely than non-hateful sentences (and by how much). I don't have that implemented already, but it seems relatively straightforward.

OR if "evaluate causal models' toxicity w/ ToxiGen" means "run a model trained on ToxiGen on the outputs of a causal model"? This is how others evaluate language models w/ ToxiGen. They just use our RoBERTA model or HateBERT model, both of which were finetuned on ToxiGen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants