Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group agg rework #1741

Open
wants to merge 84 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
84 commits
Select commit Hold shift + click to select a range
2a2566e
add greoup_config arg
lintangsutawika Apr 23, 2024
6da6d18
add a group config that allows disabling table for group score and gr…
lintangsutawika Apr 23, 2024
5a98162
fixed size configuration
lintangsutawika Apr 24, 2024
cd0ad1b
adjust config
lintangsutawika Apr 24, 2024
eb9c6a5
add group config
lintangsutawika Apr 24, 2024
4dd6906
adjust mmlu to use group_config
lintangsutawika Apr 24, 2024
9551bbf
fixed args input in aggregate_subtask_metrics
lintangsutawika Apr 25, 2024
bcc887a
fixed issues related to printing alias of group and updated yaml
lintangsutawika Apr 25, 2024
939a0cb
Merge branch 'main' of https://github.com/EleutherAI/lm-evaluation-ha…
lintangsutawika Apr 25, 2024
9ccec64
update all mmlu variants to include group_config
lintangsutawika Apr 25, 2024
0579b30
edit format
lintangsutawika Apr 25, 2024
9633619
modify mmlu tasks
lintangsutawika May 6, 2024
3473e19
adjust group to also be a configurable group
lintangsutawika May 7, 2024
86039e8
add configurable group
lintangsutawika May 7, 2024
fe2a147
simplify get_task_list
lintangsutawika May 7, 2024
62572f0
adjust group scoring with using ConfigurableGroup
lintangsutawika May 7, 2024
09bc7c6
adjust args
lintangsutawika May 7, 2024
cb085b0
update mmlu
lintangsutawika May 7, 2024
c23c930
update mmlu
lintangsutawika May 7, 2024
ad70d20
update to work with new group and task configuration
lintangsutawika May 7, 2024
03982e0
readd group_agg
lintangsutawika May 7, 2024
75dfac4
readd files
lintangsutawika May 7, 2024
110e5a2
move prepare_print_tasks to evaluator_utils
lintangsutawika May 7, 2024
8d59330
resolved merge conflict
lintangsutawika May 7, 2024
9f698c2
sort set to False by default, fix predict_only arg
lintangsutawika May 7, 2024
2a81753
add version for groups
lintangsutawika May 7, 2024
5637397
reversed task list
lintangsutawika May 7, 2024
44d7039
update additional condition when loading a group in a group yaml
lintangsutawika May 8, 2024
9f06432
update truthfulqa
lintangsutawika May 8, 2024
3f770bb
add description regarding tags replacing group
lintangsutawika May 8, 2024
2f2322b
replace group to tag
lintangsutawika May 8, 2024
c6b8132
Merge branch 'group-agg-rework' of https://github.com/EleutherAI/lm-e…
lintangsutawika May 8, 2024
a88886a
fixed conditional statement
lintangsutawika May 8, 2024
1fae728
remove warning
lintangsutawika May 10, 2024
f36fb47
update loading of task group and newly added tags
lintangsutawika May 10, 2024
1320394
reformat with pre-commit
lintangsutawika May 10, 2024
7350958
fixed info log
lintangsutawika May 10, 2024
719fa9b
update
lintangsutawika May 10, 2024
5c289f9
fix bug
lintangsutawika May 10, 2024
5a3a957
fix bug
lintangsutawika May 10, 2024
2f3d427
use task id to differentiate tasks
lintangsutawika May 10, 2024
e098647
convert all groups to configurable groups
lintangsutawika May 10, 2024
39c4027
use task_id
lintangsutawika May 10, 2024
0905615
reformat
lintangsutawika May 10, 2024
78c3f7d
add task_id for python tasks as well
lintangsutawika May 11, 2024
1ef5b0b
add task_id for python tasks as well
lintangsutawika May 11, 2024
f4d2e6e
add task_id for python tasks as well
lintangsutawika May 11, 2024
3924608
revert truthfulqa
lintangsutawika May 11, 2024
0ffdee0
revert mmlu tasks
lintangsutawika May 11, 2024
1960eb9
new mmlu config
lintangsutawika May 11, 2024
41e64b2
new group config parameter `tag_to_task`
lintangsutawika May 11, 2024
79ce346
Update truthfulqa_mc2.yaml
lintangsutawika May 11, 2024
aef7028
reformate
lintangsutawika May 11, 2024
6d1753d
add _process_group_config
lintangsutawika May 15, 2024
c90655d
adjust task_id
lintangsutawika May 15, 2024
88fea8a
add get_subtask_list function to get proper subtask list
lintangsutawika May 16, 2024
e66c5f5
group config to_dict update
lintangsutawika May 16, 2024
923f3e8
remove tag check
lintangsutawika May 16, 2024
104292f
update mmlu
lintangsutawika May 16, 2024
3be0916
fix config passing issues
lintangsutawika May 17, 2024
fd9cd80
add test yaml
lintangsutawika May 17, 2024
3e1301b
resolved merge conflict from latest version
lintangsutawika Jun 4, 2024
4eeb871
format fix
lintangsutawika Jun 4, 2024
be8b547
add documentation
lintangsutawika Jun 6, 2024
4140dd9
corner case for single tag being called
lintangsutawika Jun 6, 2024
5032eba
fix indentation
lintangsutawika Jun 7, 2024
e6b1581
formatting
lintangsutawika Jun 7, 2024
9a30374
update all mmlu variants
lintangsutawika Jun 7, 2024
ed1f757
Update docs/task_guide.md
lintangsutawika Jun 10, 2024
e8f4918
remove group_alias
lintangsutawika Jun 10, 2024
016615f
Merge branch 'group-agg-rework' of https://github.com/EleutherAI/lm-e…
lintangsutawika Jun 10, 2024
26578d2
Update docs/task_guide.md
lintangsutawika Jun 10, 2024
1848d66
remove version for metadata
lintangsutawika Jun 10, 2024
b002849
Merge branch 'group-agg-rework' of https://github.com/EleutherAI/lm-e…
lintangsutawika Jun 10, 2024
9e940f3
Update docs/task_guide.md
lintangsutawika Jun 10, 2024
ac1a1ce
update mmlu/
lintangsutawika Jun 10, 2024
e184c50
removed " " in make_table
lintangsutawika Jun 10, 2024
0f095f7
Merge branch 'group-agg-rework' of https://github.com/EleutherAI/lm-e…
lintangsutawika Jun 10, 2024
80d0f41
change how aggregate_metric is loaded
lintangsutawika Jun 10, 2024
9fa3b3f
change how aggregate_metric is loaded
lintangsutawika Jun 10, 2024
5b64fb5
update aggregate_metric arg
lintangsutawika Jun 10, 2024
83c070d
update format
lintangsutawika Jun 10, 2024
5b527a7
update format
lintangsutawika Jun 10, 2024
4376566
update task_id to be updateable and uses group:task format
lintangsutawika Jun 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
use task id to differentiate tasks
  • Loading branch information
lintangsutawika committed May 10, 2024
commit 2f3d42723bef416a6751b821a2f8ee3bf0738acf
38 changes: 24 additions & 14 deletions lm_eval/evaluator_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from typing import List, Optional, Tuple, Union

from lm_eval.api import metrics
from lm_eval.tasks import ConfigurableGroup
from lm_eval.tasks import ConfigurableGroup, ConfigurableTask
from lm_eval.utils import eval_logger, positional_deprecated


Expand Down Expand Up @@ -40,6 +40,7 @@ def __init__(
self,
task=None,
task_name=None,
task_id=None,
task_config=None,
version=None,
group_name=None,
Expand All @@ -51,6 +52,7 @@ def __init__(
self.task = task
self.task_config = task_config
self.task_name = task_name
self.task_id = task_id
self.group_name = group_name
self.version = version
self.n_shot = n_shot
Expand All @@ -76,6 +78,7 @@ def from_taskdict(cls, task_name: str, task):
task=task, task_name=task_name, is_group=is_group, group_name=group_name
)
version = task.VERSION
task_id = task.task_id
task_config = dict(task.dump_config())
if (n_shot := task_config.get("num_fewshot")) == 0:
n_shot = task_config.get("metadata", {}).get("num_fewshot", 0)
Expand All @@ -84,6 +87,7 @@ def from_taskdict(cls, task_name: str, task):
return cls(
task=task,
task_name=task_name,
task_id=task_id,
task_config=task_config,
group_name=group_name,
version=version,
Expand Down Expand Up @@ -113,9 +117,10 @@ def __repr__(self):
return (
f"TaskOutput(task_name={self.task_name}, "
f"group_name={self.group_name}, "
f"version={self.version},"
f"n_shot={self.n_shot}"
f"task_alias={self.task_alias}, group_alias={self.group_alias})"
f"version={self.version}, "
f"n_shot={self.n_shot}, "
f"task_alias={self.task_alias}, "
f"group_alias={self.group_alias})"
)


Expand Down Expand Up @@ -176,18 +181,21 @@ def prepare_print_tasks(
for task_or_group_name, task_or_group_obj in task_dict.items():
tab_string = " " * task_depth + "- " if task_depth > 0 else ""
if isinstance(task_or_group_name, ConfigurableGroup):
name = task_or_group_name.group
# name = task_or_group_name.group
name = task_or_group_name.task_id
from_configurable_group = True
elif isinstance(task_or_group_name, str):
name = task_or_group_name
if isinstance(task_or_group_obj, ConfigurableTask):
name = task_or_group_obj.task_id
from_configurable_group = False

task_agg[name] = results[name].copy()
if from_configurable_group:
if task_or_group_name.group_alias is not None:
alias = task_or_group_name.group_alias
else:
alias = name
alias = task_or_group_name.group
else:
if "alias" in task_agg[name]:
alias = task_agg[name]["alias"]
Expand Down Expand Up @@ -255,21 +263,23 @@ def consolidate_results(
versions = collections.defaultdict(dict)
for task_output in eval_tasks:
if "task_alias" in (task_config := task_output.task_config):
results[task_output.task_name]["alias"] = task_config["task_alias"]
results[task_output.task_id]["alias"] = task_config["task_alias"]
else:
results[task_output.task_id]["alias"] = task_output.task_name
if group_alias := task_output.group_alias:
if group_alias not in results and (group_name := task_output.group_name):
results[group_name]["alias"] = group_alias
num_fewshot[task_output.task_name] = task_output.n_shot
configs[task_output.task_name] = task_output.task_config
versions[task_output.task_name] = task_output.version
samples[task_output.task_name] = task_output.logged_samples
num_fewshot[task_output.task_id] = task_output.n_shot
configs[task_output.task_id] = task_output.task_config
versions[task_output.task_id] = task_output.version
samples[task_output.task_id] = task_output.logged_samples
for (metric, filter_key), items in task_output.sample_metrics.items():
metric_key = f"{metric},{filter_key}"
results[task_output.task_name][metric_key] = task_output.agg_metrics[
results[task_output.task_id][metric_key] = task_output.agg_metrics[
metric_key
]
results[task_output.task_name]["samples"] = task_output.sample_len
results[task_output.task_name][
results[task_output.task_id]["samples"] = task_output.sample_len
results[task_output.task_id][
f"{metric}_stderr,{filter_key}"
] = task_output.agg_metrics[f"{metric}_stderr,{filter_key}"]
return results, samples, configs, versions, num_fewshot
Expand Down