Standardize metrics #1167

lintangsutawika · 2023-12-19T06:47:42Z

Standardize Metrics

…rness into standardize_metrics

lm_eval/evaluator.py

baberabb · 2023-12-19T10:56:48Z

I don't quite get the distinction between set_wise_compute and aggregation. You said it's equivalent to aggregation but then you have the condition in your class?

haileyschoelkopf · 2023-12-19T12:08:18Z

Hmmm, I think I’d prefer to leave metrics as solo functions rather than objects, and just have say accuracy implemented as

acc(items):
return mean([gold == pred for gold, pred in zip(*items))])

What do you think?

This reduces the cognitive load needed for a user to make their own custom metric in a task’s utils.py since they don’t need to be super acquainted with our library, just with the type signature of results per doc which they’d need to know anyway.

Won’t this class make it harder for us to load arbitrary HF evaluate metrics as desired?

lintangsutawika · 2023-12-19T12:33:46Z

@baberabb yes, it's the same. The condition is meant so that we could call a function like mean and the set_wise_compute would call that.

@haileyschoelkopf I get what you mean. I think it might be better to push metrics to where aggregation is done where the score is calculated dataset-wise. I think this is pretty much the standard for all metric functions like sklearn and evaluate.

haileyschoelkopf · 2023-12-19T12:51:55Z

I get what you mean. I think it might be better to push metrics to where aggregation is done where the score is calculated dataset-wise. I think this is pretty much the standard for all metric functions like sklearn and evaluate

Agreed, I think this makes sense! (to have all metrics basically perform just the previous “aggregation” duty all in one go along with the per-sample calcs bundled into that one function)

haileyschoelkopf · 2024-01-11T16:52:59Z

I think I'm a little confused where this PR is currently supposed to leave us in terms of changing how metrics work. People still provide metric + aggregation function, but the actual metric function is now silently more of an aggregation, sometimes?

I think we should go with one of the following maybe:

leave metrics + aggregations as is, but refactor the process_results() for multiple choice s.t. the computations get moved into their former passthrough functions. Just improve the experience of adding a new metric / clarity of where things get defined, and improve runtime where possible.
abandon the separation of metrics + aggregations entirely, so that we only have one metric function, and just call our "metric" functions on the entire eval task a single time. If you think this will negatively impact benchmark aggregations, for example, though, we can decide not to do this.

What do you think?

haileyschoelkopf · 2024-01-11T16:54:40Z

lm_eval/api/task.py

- self._aggregation_list[metric_name] = metric_config[
- "aggregation"
- ]
+ assert isinstance(metric_name, str)


I think we should try to minimize the amount of metric handling code that is in the task.py file if possible

I don't have strong preference for it to be in task.py but currently I think atm it's the most relevant to be here since metric is attached to the task through it's config. Maybe an intermediate process between loading config and task could work?

lintangsutawika · 2024-01-11T17:14:22Z

This PR becomes mostly tidying up the metrics and aggregation. For the metrics part, I ended up making it so that most of the pass through functions could be removed. And the how it works is that by default metrics are run at the aggregation level but also backwards compatible with current implementation of metric being run in process_results and then having aggregation like it's currently implemented.

leave metrics + aggregations as is, but refactor the process_results() for multiple choice s.t. the computations get moved into their former passthrough functions. Just improve the experience of adding a new metric / clarity of where things get defined, and improve runtime where possible.

I think this could work but at the risk of over abstraction?

abandon the separation of metrics + aggregations entirely, so that we only have one metric function, and just call our "metric" functions on the entire eval task a single time. If you think this will negatively impact benchmark aggregations, for example, though, we can decide not to do this.

I think this would be more inline with how most metrics work. But the issue is what parameters do we want to share from the process_results. For multiple_choice would the process of taking the max loglikelihood as prediction be computed in the metric or in process_results?

lintangsutawika added 5 commits December 19, 2023 06:45

sample metrics that have both sample-wise and set-wise operations

e7cd7d6

change how metrics are registered

1d262a5

loglikelihood and loglikelihood rolling modified

028f04c

changed how metrics are calculated

6117c50

Merge branch 'main' of https://github.com/EleutherAI/lm-evaluation-ha…

a808c66

…rness into standardize_metrics

lintangsutawika commented Dec 19, 2023

View reviewed changes

lm_eval/evaluator.py Outdated Show resolved Hide resolved

lintangsutawika added 11 commits December 27, 2023 15:16

update

c6a9158

aggregation to compute_metric

4d49dd0

aggregation to compute_metric

9d6bc92

simplify registry

3888193

removed passthrough fn

039832e

remove aggregation

e5b245c

kwargs are added to metric_fn through partial at the beginning

20c10df

use HFEvaluateAdaptor for hf metrics

6a336b1

revert to just load metric_fn

150f11f

process hf evaluate metrics

99ce4ef

list tuple for string based multigpu collection

439dca5

haileyschoelkopf mentioned this pull request Dec 30, 2023

add bypass metric #1156

Merged

3 tasks

lintangsutawika added 8 commits January 2, 2024 14:53

readded suport for aggregation

aaf64aa

readd aggregation

787b23f

adjusted aggregation config

703e0d5

adjust to be backwards compatible

2a573a1

revert

2054c2e

revert

dfb4183

Merge branch 'main' into standardize_metrics

cda25fe

resolved git conflict

470fb31

lintangsutawika added 2 commits January 2, 2024 16:42

resolved again

dfb036b

reformat

de46fb9

haileyschoelkopf linked an issue Jan 4, 2024 that may be closed by this pull request

always get acc,acc_norm, perplexity =1 on triviaqa task based on llama2 model #1239

Open

baberabb mentioned this pull request Jan 10, 2024

exact_match computation very slow for multiple target tasks #1264

Closed

haileyschoelkopf reviewed Jan 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize metrics #1167

Standardize metrics #1167

lintangsutawika commented Dec 19, 2023 •

edited

baberabb commented Dec 19, 2023

haileyschoelkopf commented Dec 19, 2023

lintangsutawika commented Dec 19, 2023

haileyschoelkopf commented Dec 19, 2023

haileyschoelkopf commented Jan 11, 2024

haileyschoelkopf Jan 11, 2024

lintangsutawika Jan 11, 2024

lintangsutawika commented Jan 11, 2024

Standardize metrics #1167

Are you sure you want to change the base?

Standardize metrics #1167

Conversation

lintangsutawika commented Dec 19, 2023 • edited

baberabb commented Dec 19, 2023

haileyschoelkopf commented Dec 19, 2023

lintangsutawika commented Dec 19, 2023

haileyschoelkopf commented Dec 19, 2023

haileyschoelkopf commented Jan 11, 2024

haileyschoelkopf Jan 11, 2024

Choose a reason for hiding this comment

lintangsutawika Jan 11, 2024

Choose a reason for hiding this comment

lintangsutawika commented Jan 11, 2024

lintangsutawika commented Dec 19, 2023 •

edited