-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
toxigen task measures toxicity classification rather than whether generations are toxic? #974
Comments
Hi! There's Real Toxicity in the big-refactor (soon to be main) branch which evaluates the generations with the Perspective API (need a key but it's free) using a custom |
Thanks for the response, I'll keen an eye on the big-refactor getting merged into main! |
Hi @laphang, did you come up with code to implement this? I'm exactly in the same position at the moment. Thanks! |
Hi @danihinjos, no I didn't end up implementing this unfortunately. I did try the real toxicity task, but found the perspective api too slow. As an aside, I did gain more confidence in this current method after reading the Orca 2 paper, where they discuss using toxigen both for discriminative evaluation (the classification used in the current lm-eval toxigen task) and generative evaluation. And that models that were good at discriminative evaluation tended to generate less toxic outputs. But it would be good to be able to explicitly do the generative evaluation too. |
Agree, I would also find an implementation of toxigen super useful. Did you find another way outside of lm-evaluation-harness to run toxigen (relatively) easily? |
@Thartvigsen wondering if you've got an implementation of evaluating causal models' toxicity (as opposed to their use for classifying toxicity which you contributed) using Toxigen, possibly? |
@haileyschoelkopf ah sorry I missed this. The ToxiGen dataset just contains sentences w/ binary labels (hate vs. not hate) so I don't think it can be directly used for eval unless the data are restructured into pairs of hateful/non-hateful sentences? The score could be whether a causal model finds the hateful sentences much less likely than non-hateful sentences (and by how much). I don't have that implemented already, but it seems relatively straightforward. OR if "evaluate causal models' toxicity w/ ToxiGen" means "run a model trained on ToxiGen on the outputs of a causal model"? This is how others evaluate language models w/ ToxiGen. They just use our RoBERTA model or HateBERT model, both of which were finetuned on ToxiGen. |
I would like to evaluate the toxicity of the generations of my fine tuned models. I'm interested in using toxigen which seems popular (eg used in the Llama 2 paper).
However, looking at the current toxigen task, it seems to measure how well the LLM performs as a classifier when using the prompt in doc_to_tex()? (is my understanding of that correct?)
https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/toxigen.py
However, here's the relevant section (Appendix A.4.7) from the Llama 2 paper (which follows on from Section 4.1)
https://arxiv.org/pdf/2307.09288.pdf
They generate text from the LLM using the prompts from (a revised version of) the toxigen dataset, and then use a roberta classifier trained on the toxigen dataset to evaluate whether the generated outputs are toxic.
eg. the lead author of the toxigen paper has uploaded this toxigen roberta classifier
https://huggingface.co/tomh/toxigen_roberta
Assuming my understanding of the way the toxigen task in eval harness is currently working is correct, I was wondering if the task could be modified (or a new task created) which used a toxigen roberta classifier to measure whether the LLM generations are toxic?
Alternatively, are there any other tasks that are useful in terms of measuring the toxicity of LLM generations?
The text was updated successfully, but these errors were encountered: