Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize metrics #1167

Draft
wants to merge 26 commits into
base: main
Choose a base branch
from
Draft

Standardize metrics #1167

wants to merge 26 commits into from

Conversation

lintangsutawika
Copy link
Contributor

@lintangsutawika lintangsutawika commented Dec 19, 2023

Standardize Metrics

lm_eval/evaluator.py Outdated Show resolved Hide resolved
@baberabb
Copy link
Contributor

I don't quite get the distinction between set_wise_compute and aggregation. You said it's equivalent to aggregation but then you have the condition in your class?

@haileyschoelkopf
Copy link
Contributor

Hmmm, I think I’d prefer to leave metrics as solo functions rather than objects, and just have say accuracy implemented as

acc(items):
return mean([gold == pred for gold, pred in zip(*items))])

What do you think?

This reduces the cognitive load needed for a user to make their own custom metric in a task’s utils.py since they don’t need to be super acquainted with our library, just with the type signature of results per doc which they’d need to know anyway.

Won’t this class make it harder for us to load arbitrary HF evaluate metrics as desired?

@lintangsutawika
Copy link
Contributor Author

@baberabb yes, it's the same. The condition is meant so that we could call a function like mean and the set_wise_compute would call that.

@haileyschoelkopf I get what you mean. I think it might be better to push metrics to where aggregation is done where the score is calculated dataset-wise. I think this is pretty much the standard for all metric functions like sklearn and evaluate.

@haileyschoelkopf
Copy link
Contributor

I get what you mean. I think it might be better to push metrics to where aggregation is done where the score is calculated dataset-wise. I think this is pretty much the standard for all metric functions like sklearn and evaluate

Agreed, I think this makes sense! (to have all metrics basically perform just the previous “aggregation” duty all in one go along with the per-sample calcs bundled into that one function)

@haileyschoelkopf haileyschoelkopf mentioned this pull request Dec 30, 2023
3 tasks
@haileyschoelkopf
Copy link
Contributor

I think I'm a little confused where this PR is currently supposed to leave us in terms of changing how metrics work. People still provide metric + aggregation function, but the actual metric function is now silently more of an aggregation, sometimes?

I think we should go with one of the following maybe:

  • leave metrics + aggregations as is, but refactor the process_results() for multiple choice s.t. the computations get moved into their former passthrough functions. Just improve the experience of adding a new metric / clarity of where things get defined, and improve runtime where possible.
  • abandon the separation of metrics + aggregations entirely, so that we only have one metric function, and just call our "metric" functions on the entire eval task a single time. If you think this will negatively impact benchmark aggregations, for example, though, we can decide not to do this.

What do you think?

self._aggregation_list[metric_name] = metric_config[
"aggregation"
]
assert isinstance(metric_name, str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try to minimize the amount of metric handling code that is in the task.py file if possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong preference for it to be in task.py but currently I think atm it's the most relevant to be here since metric is attached to the task through it's config. Maybe an intermediate process between loading config and task could work?

@lintangsutawika
Copy link
Contributor Author

This PR becomes mostly tidying up the metrics and aggregation. For the metrics part, I ended up making it so that most of the pass through functions could be removed. And the how it works is that by default metrics are run at the aggregation level but also backwards compatible with current implementation of metric being run in process_results and then having aggregation like it's currently implemented.

leave metrics + aggregations as is, but refactor the process_results() for multiple choice s.t. the computations get moved into their former passthrough functions. Just improve the experience of adding a new metric / clarity of where things get defined, and improve runtime where possible.

I think this could work but at the risk of over abstraction?

abandon the separation of metrics + aggregations entirely, so that we only have one metric function, and just call our "metric" functions on the entire eval task a single time. If you think this will negatively impact benchmark aggregations, for example, though, we can decide not to do this.

I think this would be more inline with how most metrics work. But the issue is what parameters do we want to share from the process_results. For multiple_choice would the process of taking the max loglikelihood as prediction be computed in the metric or in process_results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

always get acc,acc_norm, perplexity =1 on triviaqa task based on llama2 model
3 participants