Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group agg rework #1741

Open
wants to merge 84 commits into
base: main
Choose a base branch
from
Open

Group agg rework #1741

wants to merge 84 commits into from

Conversation

lintangsutawika
Copy link
Contributor

@lintangsutawika lintangsutawika commented Apr 23, 2024

  1. By default, group will not feature the aggregate scores of their subtasks.
  2. To show an aggregate, a group_config will need to be defined in the yaml that consists of aggregate_metric (True/False, default False) and weight_by_size (True/False, default False).
  3. Use task_id in ConfigurableGroup and ConfigurableTask to be used as identifier in lieu of task name/group name.

@lintangsutawika lintangsutawika marked this pull request as ready for review April 25, 2024 18:05
@lintangsutawika
Copy link
Contributor Author

@haileyschoelkopf I've only added the group_config to MMLU tasks and flan_held_in. Let me know what other benchmarks that would need the aggregation to be added back in.

lm_eval/evaluator.py Outdated Show resolved Hide resolved
lm_eval/api/task.py Outdated Show resolved Hide resolved
lm_eval/api/task.py Outdated Show resolved Hide resolved
Copy link
Contributor

@haileyschoelkopf haileyschoelkopf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very close to done!

Left some comments, and messaged you re: how we should perhaps change the GroupConfig interface

Other than that:

  • we should go through and determine which of existing groups should be aggregating and which should not. I can do this
  • I need to go through again more finely the evaluator.py changes, there's quite a lot of logic in there at this point that I want to look more closely at
  • We should couple this with a readme update mentioning this change + referring people to Discord and the docs files for better understanding what's affected.

Will try to aim for having this merged by end of next week. Sorry for all the delays!

lm_eval/utils.py Outdated Show resolved Hide resolved
@@ -242,7 +242,7 @@ def get_original(self, newarr):
return res


def make_table(result_dict, column: str = "results", sort_results: bool = True):
def make_table(result_dict, column: str = "results", sort_results: bool = False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reason for not sorting by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It messes up the group name, as in the group name also gets sorted.

lm_eval/tasks/mmlu/flan_n_shot/loglikelihood/_mmlu.yaml Outdated Show resolved Hide resolved
docs/task_guide.md Outdated Show resolved Hide resolved
docs/task_guide.md Outdated Show resolved Hide resolved
docs/task_guide.md Show resolved Hide resolved
docs/task_guide.md Show resolved Hide resolved
docs/task_guide.md Outdated Show resolved Hide resolved
lm_eval/api/task.py Outdated Show resolved Hide resolved
@@ -0,0 +1,45 @@
group: test-1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an assorted test for this? Note also #1916 should help with PRs like this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants