Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize metrics #1167

Draft
wants to merge 26 commits into
base: main
Choose a base branch
from
Draft
Changes from 1 commit
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
e7cd7d6
sample metrics that have both sample-wise and set-wise operations
lintangsutawika Dec 19, 2023
1d262a5
change how metrics are registered
lintangsutawika Dec 19, 2023
028f04c
loglikelihood and loglikelihood rolling modified
lintangsutawika Dec 19, 2023
6117c50
changed how metrics are calculated
lintangsutawika Dec 19, 2023
a808c66
Merge branch 'main' of https://github.com/EleutherAI/lm-evaluation-ha…
lintangsutawika Dec 19, 2023
c6a9158
update
lintangsutawika Dec 27, 2023
4d49dd0
aggregation to compute_metric
lintangsutawika Dec 28, 2023
9d6bc92
aggregation to compute_metric
lintangsutawika Dec 28, 2023
3888193
simplify registry
lintangsutawika Dec 28, 2023
039832e
removed passthrough fn
lintangsutawika Dec 28, 2023
e5b245c
remove aggregation
lintangsutawika Dec 28, 2023
20c10df
kwargs are added to metric_fn through partial at the beginning
lintangsutawika Dec 28, 2023
6a336b1
use HFEvaluateAdaptor for hf metrics
lintangsutawika Dec 28, 2023
150f11f
revert to just load metric_fn
lintangsutawika Dec 28, 2023
99ce4ef
process hf evaluate metrics
lintangsutawika Dec 28, 2023
439dca5
list tuple for string based multigpu collection
lintangsutawika Dec 29, 2023
aaf64aa
readded suport for aggregation
lintangsutawika Jan 2, 2024
787b23f
readd aggregation
lintangsutawika Jan 2, 2024
703e0d5
adjusted aggregation config
lintangsutawika Jan 2, 2024
2a573a1
adjust to be backwards compatible
lintangsutawika Jan 2, 2024
2054c2e
revert
lintangsutawika Jan 2, 2024
dfb4183
revert
lintangsutawika Jan 2, 2024
cda25fe
Merge branch 'main' into standardize_metrics
lintangsutawika Jan 2, 2024
470fb31
resolved git conflict
lintangsutawika Jan 2, 2024
dfb036b
resolved again
lintangsutawika Jan 2, 2024
de46fb9
reformat
lintangsutawika Jan 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
readd aggregation
  • Loading branch information
lintangsutawika committed Jan 2, 2024
commit 787b23f6997a11d02252d386062c1314098b315e
7 changes: 3 additions & 4 deletions lm_eval/evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -449,16 +449,15 @@ def evaluate(
else:
group_name = None

metric_fn = task.compute_metric()[metric]
results[task_name][metric_key] = metric_fn(items)
agg_fn = task.aggregation()[metric]
results[task_name][metric_key] = agg_fn(items)
results[task_name]["samples"] = len(items)

# hotfix: bleu, chrf, ter seem to be really expensive to bootstrap
# so we run them less iterations. still looking for a cleaner way to do this
if bootstrap_iters > 0:
stderr = lm_eval.api.metrics.stderr_for_metric(
# metric=task.aggregation()[metric],
metric=task.compute_metric()[metric],
metric=task.aggregation()[metric],
bootstrap_iters=min(bootstrap_iters, 100)
if metric in ["bleu", "chrf", "ter"]
else bootstrap_iters,
Expand Down