Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion matrix metric #1921

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

minaremeli
Copy link

For classification tasks it is often useful to plot the confusion matrix.

Implementation details

Added a new metric called cm to metrics.py, with aggregation function that calls sklearn's confusion_matrix().

In utils.py, added a new function make_confusion_matrix() which is called after make_table() in ___main___.py.

Example

For example, by adding cm to the list of evaluated metrics for the sst2 task:
lm_eval/tasks/glue/sst2/default.yaml

metric_list:
  - metric: acc
  - metric: cm

And executing the following command:

lm_eval --model hf \
    --model_args pretrained=gpt2 \
    --tasks sst2 \
    --batch_size 8  

We get this output:

hf (pretrained=gpt2), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
|Tasks|Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|-----|------:|------|-----:|------|---|-----:|---|-----:|
|sst2 |      1|none  |     0|acc   |↑  |0.5505|±  |0.0169|

Confusion Matrix for task: sst2
|t/p|C0 |C1 |
|---|--:|--:|
|C0 |409| 19|
|C1 |373| 71|

Where the rows represent the true classes (C0 and C1, which correspond to the negative and positive classes) and the columns represent the predicted classes.

@CLAassistant
Copy link

CLAassistant commented Jun 4, 2024

CLA assistant check
All committers have signed the CLA.

@StellaAthena
Copy link
Member

This looks excellent! Which task(s) were you thinking of adding it to? "All multiple choice tasks with two options," or something else?

@minaremeli
Copy link
Author

@StellaAthena Thank you! Multiple options are supported too. For example bhtc_v2 task from basque-glue is a 12-class classification task, for which the output would look something like this:

hf (pretrained=gpt2), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
| Tasks |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|-------|------:|------|-----:|------|---|-----:|---|------|
|bhtc_v2|      1|none  |     0|f1    |↑  |0.0491|±  |N/A   |

Confusion Matrix for task: bhtc_v2
|t/p|C0 |C1 |C2 |C3 |C4 |C5 |C6 |C7 |C8 |C9 |C10|C11|
|---|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|
|C0 |  7|  1|  0|  0|149|  0|  0|  0| 12|  0|  0|  0|
|C1 |  1|  0|  0|  0| 15|  0|  0|  0|  1|  0|  0|  0|
|C2 |  0|  1|  8|  0| 77|  0|  0|  0|  8|  0|  0|  0|
|C3 |  1|  2|  0|  0|501|  0|  0|  0| 31|  0|  0|  0|
|C4 |  0|  0|  0|  0| 59|  0|  0|  0|  2|  0|  0|  0|
|C5 |  2|  0|  0|  0|151|  0|  0|  0| 22|  0|  1|  1|
|C6 |  1|  0|  0|  0| 52|  0|  0|  0|  6|  0|  1|  0|
|C7 |  0|  0|  0|  0| 19|  0|  0|  0|  2|  0|  1|  0|
|C8 |  0|  2|  1|  0|154|  0|  1|  0| 16|  0|  0|  0|
|C9 |  2|  0|  0|  0|238|  0|  0|  0|  8|  0|  1|  0|
|C10|  1|  0|  0|  0|281|  0|  0|  1|  4|  0|  0|  0|
|C11|  0|  0|  0|  0|  7|  0|  0|  0|  1|  0|  0|  1|

Although it might make sense to put out a disclaimer since most multiple choice tasks are not necessarily classification tasks (for example most QA type tasks). But it still might be useful if people wanted to check for A-bias (the tendency of a model towards picking the answer choice labeled “A").

Copy link
Contributor

@haileyschoelkopf haileyschoelkopf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @minaremeli , thanks for the PR! This looks like a very cool contribution.

I'm however hesitant to have this printed every run. It also might be nice to be able to apply this to acc_norm as well as acc.

Would you be willing to make this a scripts/confusion_matrix.py utility that can take in the path to a logged per-sample results file as well as a target (multiple-choice) metric, and write the confusion matrix for that task? That way we can promote it as something for a user to use when exploring their results, but not end up printing too much info every time.

@minaremeli
Copy link
Author

@haileyschoelkopf Thank you for your for your feedback! I can see how the current solution might crowd the stdout, especially if multiple tasks are being evaluated... I would be happy to change the PR to fit your suggestions.

I like your idea of moving it to the per-sample results file (or even a separate file?). Do you suggest to simply append the CM to that file? And should CM stay a metric to be included in the config file of the task or would it make more sense as a command line argument like --log_samples (maybe --log_cm)?

@haileyschoelkopf
Copy link
Contributor

I was suggesting that we have a workflow like

  • run lm_eval --task xxx --log_samples --output_path ./my-folder to log per-sample , for any multiple-choice task
  • Run python scripts/make_confusion_matrix.py --file ./my-folder/model_name/{samples-for-task-I-want} --metric acc to solely read in the logged samples for a task, calculate a confusion matrix (using the indicated metric, if it is a 0-1 binary metric), and print it for that task

How does this sound?

@StellaAthena
Copy link
Member

@minaremeli hey, I wanted to follow up about this and see what your progress is / if you're stuck on anything.

@minaremeli
Copy link
Author

Hey @StellaAthena! No progress yet, I haven't had time to start working on this change up until now. Let me get back to you on Friday

Comment on lines +58 to +66
results = sample["resps"]
lls = [float(r[0][0]) for r in results]
pred = np.argmax(lls)
pred_norm = np.argmax(lls / completion_len)
if metric == "acc":
pred_label = choices[pred]
elif metric == "acc_norm":
pred_label = choices[pred_norm]
predictions.append(pred_label)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prediction method (acc or acc_norm) copied based on this. A drawback of this is that if the original method changes (e.g. normalization constant changes from number of characters to number of tokens), this will not be reflected here.

Comment on lines +47 to +56
if isinstance(target, int):
target_label = choices[target]
targets.append(target_label)
elif isinstance(target, str):
assert target in choices
targets.append(target)
elif isinstance(target, list):
raise NotImplementedError(
"No support yet for multi-label confusion matrix!"
)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Target can be specified asint (index of true label) or str (true label) or list (multi-label tasks). I decided to not implement the multi-label case.

@minaremeli
Copy link
Author

@haileyschoelkopf @StellaAthena I changed my implementation to the proposed script version. I also left some comments on parts that might require your input. What's your suggestion regarding documentation?

@minaremeli
Copy link
Author

Also, I noticed that the sample results files have changed recently from json to jsonl. Any future changes to how the samples are logged might break this feature, this is something to keep in mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants