Cluster bootstrap for metrics; refactor metric computations into evaluate_preds #197

norabelrose · 2023-04-17T20:12:47Z

I realized that our code for computing ROC AUROC, accuracy, and calibrated accuracy were sort of all over the place and there was a decent amount of code duplication. This PR refactors all of that into a single function evaluate_preds which is used for both reporters and logistic regression classifiers, in elicit as well as eval.

Other changes:

Confidence intervals now use the cluster bootstrap, resampling entire groups of prompt templates at a time, to take account of the fact that different variants of the same data point are not IID. This leads to significantly larger CIs than those reported in main.
Partially in order to account for (1), I've increased the default for max_examples from [750, 250] to [1000, 1000]. There's just too much noise in the data with only 250 clusters.
Confidence intervals are now included for accuracy and calibrated accuracy

AlexTMallen

LGTM

norabelrose added 3 commits April 17, 2023 20:10

Refactor metrics into evaluate_preds

6eb18a4

Fix stupid CCS bug

9f5759b

Cluster bootstrap for AUROC; boost default sample size

14d1323

norabelrose changed the title ~~Refactor metric computations into evaluate_preds~~ Cluster bootstrap for AUROC; refactor metric computations into evaluate_preds Apr 17, 2023

Cluster bootstrap for accuracy

bc3f29c

norabelrose changed the title ~~Cluster bootstrap for AUROC; refactor metric computations into evaluate_preds~~ Cluster bootstrap for metrics; refactor metric computations into evaluate_preds Apr 17, 2023

norabelrose requested review from AlexTMallen and lauritowal April 17, 2023 22:37

Allow for arbitrary hparam selection in sweep

b829538

This was referenced Apr 17, 2023

Fix acc for supervised (in the same way as #195) #196

Closed

Allow for arbitrary hyperparameter selection for sweep #198

Closed

norabelrose added 8 commits April 18, 2023 00:00

Don't normalize LM probs twice

8e7dfff

Merge branch 'main' into metric-refactor

83716dc

Merge branch 'main' into metric-refactor

2a2c319

Merge branch 'main' into metric-refactor

6358b83

Merge branch 'main' into metric-refactor

f7fa3b1

Fix normalization of LM logits

d625f7b

Merge branch 'sweep-extra' into metric-refactor

2dc6ec2

Merge branch 'main' into metric-refactor

42adb2e

AlexTMallen approved these changes Apr 19, 2023

View reviewed changes

norabelrose merged commit 4d65f9c into main Apr 19, 2023
4 checks passed

norabelrose deleted the metric-refactor branch April 19, 2023 19:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster bootstrap for metrics; refactor metric computations into evaluate_preds #197

Cluster bootstrap for metrics; refactor metric computations into evaluate_preds #197

norabelrose commented Apr 17, 2023 •

edited

AlexTMallen left a comment

Cluster bootstrap for metrics; refactor metric computations into evaluate_preds #197

Cluster bootstrap for metrics; refactor metric computations into evaluate_preds #197

Conversation

norabelrose commented Apr 17, 2023 • edited

AlexTMallen left a comment

Choose a reason for hiding this comment

norabelrose commented Apr 17, 2023 •

edited