Confusion matrix metric #1921

minaremeli · 2024-06-04T09:11:20Z

For classification tasks it is often useful to plot the confusion matrix.

Implementation details

Added a new metric called cm to metrics.py, with aggregation function that calls sklearn's confusion_matrix().

In utils.py, added a new function make_confusion_matrix() which is called after make_table() in ___main___.py.

Example

For example, by adding cm to the list of evaluated metrics for the sst2 task:
lm_eval/tasks/glue/sst2/default.yaml

metric_list:
  - metric: acc
  - metric: cm

And executing the following command:

lm_eval --model hf \
    --model_args pretrained=gpt2 \
    --tasks sst2 \
    --batch_size 8

We get this output:

hf (pretrained=gpt2), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
|Tasks|Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|-----|------:|------|-----:|------|---|-----:|---|-----:|
|sst2 |      1|none  |     0|acc   |↑  |0.5505|±  |0.0169|

Confusion Matrix for task: sst2
|t/p|C0 |C1 |
|---|--:|--:|
|C0 |409| 19|
|C1 |373| 71|

Where the rows represent the true classes (C0 and C1, which correspond to the negative and positive classes) and the columns represent the predicted classes.

CLAassistant · 2024-06-04T09:11:26Z

All committers have signed the CLA.

StellaAthena · 2024-06-06T15:17:46Z

This looks excellent! Which task(s) were you thinking of adding it to? "All multiple choice tasks with two options," or something else?

minaremeli · 2024-06-07T07:31:31Z

@StellaAthena Thank you! Multiple options are supported too. For example bhtc_v2 task from basque-glue is a 12-class classification task, for which the output would look something like this:

hf (pretrained=gpt2), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
| Tasks |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|-------|------:|------|-----:|------|---|-----:|---|------|
|bhtc_v2|      1|none  |     0|f1    |↑  |0.0491|±  |N/A   |

Confusion Matrix for task: bhtc_v2
|t/p|C0 |C1 |C2 |C3 |C4 |C5 |C6 |C7 |C8 |C9 |C10|C11|
|---|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|
|C0 |  7|  1|  0|  0|149|  0|  0|  0| 12|  0|  0|  0|
|C1 |  1|  0|  0|  0| 15|  0|  0|  0|  1|  0|  0|  0|
|C2 |  0|  1|  8|  0| 77|  0|  0|  0|  8|  0|  0|  0|
|C3 |  1|  2|  0|  0|501|  0|  0|  0| 31|  0|  0|  0|
|C4 |  0|  0|  0|  0| 59|  0|  0|  0|  2|  0|  0|  0|
|C5 |  2|  0|  0|  0|151|  0|  0|  0| 22|  0|  1|  1|
|C6 |  1|  0|  0|  0| 52|  0|  0|  0|  6|  0|  1|  0|
|C7 |  0|  0|  0|  0| 19|  0|  0|  0|  2|  0|  1|  0|
|C8 |  0|  2|  1|  0|154|  0|  1|  0| 16|  0|  0|  0|
|C9 |  2|  0|  0|  0|238|  0|  0|  0|  8|  0|  1|  0|
|C10|  1|  0|  0|  0|281|  0|  0|  1|  4|  0|  0|  0|
|C11|  0|  0|  0|  0|  7|  0|  0|  0|  1|  0|  0|  1|

Although it might make sense to put out a disclaimer since most multiple choice tasks are not necessarily classification tasks (for example most QA type tasks). But it still might be useful if people wanted to check for A-bias (the tendency of a model towards picking the answer choice labeled “A").

haileyschoelkopf

Hi @minaremeli , thanks for the PR! This looks like a very cool contribution.

I'm however hesitant to have this printed every run. It also might be nice to be able to apply this to acc_norm as well as acc.

Would you be willing to make this a scripts/confusion_matrix.py utility that can take in the path to a logged per-sample results file as well as a target (multiple-choice) metric, and write the confusion matrix for that task? That way we can promote it as something for a user to use when exploring their results, but not end up printing too much info every time.

minaremeli · 2024-06-07T14:31:59Z

@haileyschoelkopf Thank you for your for your feedback! I can see how the current solution might crowd the stdout, especially if multiple tasks are being evaluated... I would be happy to change the PR to fit your suggestions.

I like your idea of moving it to the per-sample results file (or even a separate file?). Do you suggest to simply append the CM to that file? And should CM stay a metric to be included in the config file of the task or would it make more sense as a command line argument like --log_samples (maybe --log_cm)?

haileyschoelkopf · 2024-06-07T15:01:50Z

I was suggesting that we have a workflow like

run lm_eval --task xxx --log_samples --output_path ./my-folder to log per-sample , for any multiple-choice task
Run python scripts/make_confusion_matrix.py --file ./my-folder/model_name/{samples-for-task-I-want} --metric acc to solely read in the logged samples for a task, calculate a confusion matrix (using the indicated metric, if it is a 0-1 binary metric), and print it for that task

How does this sound?

StellaAthena · 2024-06-18T16:22:33Z

@minaremeli hey, I wanted to follow up about this and see what your progress is / if you're stuck on anything.

minaremeli · 2024-06-19T07:24:37Z

Hey @StellaAthena! No progress yet, I haven't had time to start working on this change up until now. Let me get back to you on Friday

Modified implementation as per the proposal in EleutherAI#1921.

This reverts commit 8a15ff5.

minaremeli · 2024-06-21T11:33:50Z

scripts/make_confusion_matrix.py

+ results = sample["resps"]
+ lls = [float(r[0][0]) for r in results]
+ pred = np.argmax(lls)
+ pred_norm = np.argmax(lls / completion_len)
+ if metric == "acc":
+ pred_label = choices[pred]
+ elif metric == "acc_norm":
+ pred_label = choices[pred_norm]
+ predictions.append(pred_label)


Prediction method (acc or acc_norm) copied based on this. A drawback of this is that if the original method changes (e.g. normalization constant changes from number of characters to number of tokens), this will not be reflected here.

minaremeli · 2024-06-21T11:37:15Z

scripts/make_confusion_matrix.py

+ if isinstance(target, int):
+ target_label = choices[target]
+ targets.append(target_label)
+ elif isinstance(target, str):
+ assert target in choices
+ targets.append(target)
+ elif isinstance(target, list):
+ raise NotImplementedError(
+ "No support yet for multi-label confusion matrix!"
+ )


Target can be specified asint (index of true label) or str (true label) or list (multi-label tasks). I decided to not implement the multi-label case.

minaremeli · 2024-06-21T11:48:33Z

@haileyschoelkopf @StellaAthena I changed my implementation to the proposed script version. I also left some comments on parts that might require your input. What's your suggestion regarding documentation?

minaremeli · 2024-06-21T11:53:23Z

Also, I noticed that the sample results files have changed recently from json to jsonl. Any future changes to how the samples are logged might break this feature, this is something to keep in mind.

minaremeli requested review from haileyschoelkopf and lintangsutawika as code owners June 4, 2024 09:11

Add confusion matrix metric

8a15ff5

minaremeli force-pushed the confusion_matrix branch from 761345e to 8a15ff5 Compare June 7, 2024 07:38

haileyschoelkopf reviewed Jun 7, 2024

View reviewed changes

minaremeli added 3 commits June 19, 2024 15:27

Merge branch 'main' into confusion_matrix

0e05458

Script version of confusion_matrix

08c2fa3

Modified implementation as per the proposal in EleutherAI#1921.

Revert "Add confusion matrix metric"

4bdc9fc

This reverts commit 8a15ff5.

minaremeli commented Jun 21, 2024

View reviewed changes

Merge branch 'main' into confusion_matrix

78ea9a3

Merge branch 'main' into confusion_matrix

fa973d5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion matrix metric #1921

Confusion matrix metric #1921

minaremeli commented Jun 4, 2024

CLAassistant commented Jun 4, 2024 •

edited

Loading

StellaAthena commented Jun 6, 2024

minaremeli commented Jun 7, 2024

haileyschoelkopf left a comment

minaremeli commented Jun 7, 2024

haileyschoelkopf commented Jun 7, 2024

StellaAthena commented Jun 18, 2024

minaremeli commented Jun 19, 2024

minaremeli Jun 21, 2024

minaremeli Jun 21, 2024

minaremeli commented Jun 21, 2024

minaremeli commented Jun 21, 2024

Confusion matrix metric #1921

Are you sure you want to change the base?

Confusion matrix metric #1921

Conversation

minaremeli commented Jun 4, 2024

Implementation details

Example

CLAassistant commented Jun 4, 2024 • edited Loading

StellaAthena commented Jun 6, 2024

minaremeli commented Jun 7, 2024

haileyschoelkopf left a comment

Choose a reason for hiding this comment

minaremeli commented Jun 7, 2024

haileyschoelkopf commented Jun 7, 2024

StellaAthena commented Jun 18, 2024

minaremeli commented Jun 19, 2024

minaremeli Jun 21, 2024

Choose a reason for hiding this comment

minaremeli Jun 21, 2024

Choose a reason for hiding this comment

minaremeli commented Jun 21, 2024

minaremeli commented Jun 21, 2024

CLAassistant commented Jun 4, 2024 •

edited

Loading