toxigen task measures toxicity classification rather than whether generations are toxic? #974

laphang · 2023-11-08T04:32:21Z

I would like to evaluate the toxicity of the generations of my fine tuned models. I'm interested in using toxigen which seems popular (eg used in the Llama 2 paper).

However, looking at the current toxigen task, it seems to measure how well the LLM performs as a classifier when using the prompt in doc_to_tex()? (is my understanding of that correct?)
https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/toxigen.py

However, here's the relevant section (Appendix A.4.7) from the Llama 2 paper (which follows on from Section 4.1)
https://arxiv.org/pdf/2307.09288.pdf

Toxicity. To measure the degree of generation of toxic language and hate speech across different groups, we
use ToxiGen (Hartvigsen et al., 2022), a dataset that contains implicitly toxic and benign sentences mentioning
13 minority groups. We adopt a revised version of the dataset from Hosseini et al. (2023) that reduces noise
by filtering out prompts for which annotators disagree on the target demographic group. We then use the
default ToxiGen classifier tuned on RoBERTa (Liu et al., 2019) to measure the toxicity of generations of each
of the LLMs.

They generate text from the LLM using the prompts from (a revised version of) the toxigen dataset, and then use a roberta classifier trained on the toxigen dataset to evaluate whether the generated outputs are toxic.

eg. the lead author of the toxigen paper has uploaded this toxigen roberta classifier
https://huggingface.co/tomh/toxigen_roberta

Assuming my understanding of the way the toxigen task in eval harness is currently working is correct, I was wondering if the task could be modified (or a new task created) which used a toxigen roberta classifier to measure whether the LLM generations are toxic?
Alternatively, are there any other tasks that are useful in terms of measuring the toxicity of LLM generations?

baberabb · 2023-11-09T04:07:36Z

Hi! There's Real Toxicity in the big-refactor (soon to be main) branch which evaluates the generations with the Perspective API (need a key but it's free) using a custom metric.py. Think you should also be able to modify this to pass the model outputs to an arbitrary classifier rather than the API and/or also use a different dataset in the task yaml.

laphang · 2023-11-09T05:34:04Z

Thanks for the response, I'll keen an eye on the big-refactor getting merged into main!

danihinjos · 2024-01-29T18:17:27Z

Hi @laphang, did you come up with code to implement this? I'm exactly in the same position at the moment. Thanks!

laphang · 2024-01-30T00:23:10Z

Hi @danihinjos, no I didn't end up implementing this unfortunately.

I did try the real toxicity task, but found the perspective api too slow.

As an aside, I did gain more confidence in this current method after reading the Orca 2 paper, where they discuss using toxigen both for discriminative evaluation (the classification used in the current lm-eval toxigen task) and generative evaluation. And that models that were good at discriminative evaluation tended to generate less toxic outputs.
https://www.microsoft.com/en-us/research/publication/orca-2-teaching-small-language-models-how-to-reason/

But it would be good to be able to explicitly do the generative evaluation too.

tea-shaped · 2024-04-14T23:34:20Z

Agree, I would also find an implementation of toxigen super useful. Did you find another way outside of lm-evaluation-harness to run toxigen (relatively) easily?

haileyschoelkopf · 2024-04-16T01:51:06Z

@Thartvigsen wondering if you've got an implementation of evaluating causal models' toxicity (as opposed to their use for classifying toxicity which you contributed) using Toxigen, possibly?

Thartvigsen · 2024-05-23T15:00:12Z

@haileyschoelkopf ah sorry I missed this. The ToxiGen dataset just contains sentences w/ binary labels (hate vs. not hate) so I don't think it can be directly used for eval unless the data are restructured into pairs of hateful/non-hateful sentences? The score could be whether a causal model finds the hateful sentences much less likely than non-hateful sentences (and by how much). I don't have that implemented already, but it seems relatively straightforward.

OR if "evaluate causal models' toxicity w/ ToxiGen" means "run a model trained on ToxiGen on the outputs of a causal model"? This is how others evaluate language models w/ ToxiGen. They just use our RoBERTA model or HateBERT model, both of which were finetuned on ToxiGen.

laphang closed this as completed Nov 9, 2023

haileyschoelkopf reopened this Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

toxigen task measures toxicity classification rather than whether generations are toxic? #974

toxigen task measures toxicity classification rather than whether generations are toxic? #974

laphang commented Nov 8, 2023 •

edited

baberabb commented Nov 9, 2023

laphang commented Nov 9, 2023

danihinjos commented Jan 29, 2024

laphang commented Jan 30, 2024

tea-shaped commented Apr 14, 2024

haileyschoelkopf commented Apr 16, 2024

Thartvigsen commented May 23, 2024

toxigen task measures toxicity classification rather than whether generations are toxic? #974

toxigen task measures toxicity classification rather than whether generations are toxic? #974

Comments

laphang commented Nov 8, 2023 • edited

baberabb commented Nov 9, 2023

laphang commented Nov 9, 2023

danihinjos commented Jan 29, 2024

laphang commented Jan 30, 2024

tea-shaped commented Apr 14, 2024

haileyschoelkopf commented Apr 16, 2024

Thartvigsen commented May 23, 2024

laphang commented Nov 8, 2023 •

edited