Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easier unitxt tasks loading and removal of unitxt library dependancy #1933

Merged
merged 9 commits into from
Jul 8, 2024

Conversation

elronbandel
Copy link
Contributor

Easier loading for our lm-eval-harness users. This has significant importance for our user base and can allow us to bring more users to the lm-eval-harness platform.

Signed-off-by: Elron Bandel <[email protected]>
@CLAassistant
Copy link

CLAassistant commented Jun 6, 2024

CLA assistant check
All committers have signed the CLA.

@elronbandel elronbandel changed the title Updated unitxt loading Easier unitxt tasks loading and removal of unitxt library dependancy Jun 6, 2024
lm_eval/tasks/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

@haileyschoelkopf haileyschoelkopf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @elronbandel , thank you for this PR!

Just 2 small points before we can merge:

  • made a comment about naming conventions for unitxt tasks: which do you prefer?
  • +1 to testing the small change in __init__.py on some of the other Python tasks in the repo if possible: squad_completion, fda, swde would be some nice tasks to test this on.

lm_eval/tasks/unitxt/20_newsgroups.yaml Show resolved Hide resolved
@elronbandel
Copy link
Contributor Author

Thank you for reviewing my PR so quickly @haileyschoelkopf @LSinev!

  1. I adjusted every use of ConfigurableTask to accept the config.

  2. The following command worked: lm_eval --model hf --model_args pretrained=google/flan-t5-small --tasks fda,squad_completion,squadv2,swde --device cpu --limit 10
    The output is as expected:

Tasks Version Filter n-shot Metric Value Stderr
fda 0 none 0 contains 0.0 ± N/A
squad_completion 0 none 0 contains 0.2 ± N/A
squadv2 3 none 0 exact 60.0 ± N/A
none 0 f1 60.0 ± N/A
none 0 HasAns_exact 100.0 ± N/A
none 0 HasAns_f1 100.0 ± N/A
none 0 NoAns_exact 0.0 ± N/A
none 0 NoAns_f1 0.0 ± N/A
none 0 best_exact 70.0 ± N/A
none 0 best_f1 70.0 ± N/A
swde 0 none 0 contains 0.3 ± N/A

@LSinev
Copy link
Contributor

LSinev commented Jun 9, 2024

I believe, one should proactively find every task affected by this small change in __init__.py, and then check and update them accordingly.
For example, one could search for python tasks with class: !function search string. Then find every class which may be affected. Not only ConfigurableTask descendants like (but not limited to this example) https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/squadv2/task.py#L49 but also Task descendants like (but not limited to this example) https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/scrolls/task.py#L111.

@elronbandel
Copy link
Contributor Author

elronbandel commented Jun 11, 2024

@LSinev Do you think it might be safer to use this function:

import inspect

def has_config_in_constructor(cls):
    """
    Determines whether the constructor of the provided class has a parameter named 'config'.
    
    Args:
    cls (type): The class to be inspected.

    Returns:
    bool: True if the 'config' parameter exists in the constructor, False otherwise.
    """
    constructor = getattr(cls, '__init__', None)
    return 'config' in inspect.signature(constructor).parameters if constructor else False

Then conditioning on it:

if has_config_in_constructor(config["class"]):
    task = config["class"](config=config)
else:
    task = config["class"]()

@LSinev
Copy link
Contributor

LSinev commented Jun 13, 2024

This has_config_in_constructor check may be a good idea for new test added.
Also it seems good for compatibility with all python tasks already done in other repos to be included in benchmarking through task adding (--include_path option used) — they will not fail after update.

Idea behind changing all python tasks in this repo accordingly is not only fully-working repo, but also providing useful and working examples for new task creators. With has_config_in_constructor check there will be two options available: with or without config. So, thorough documentation improvement is also needed, as current state "do the way we already did in example provided" in https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#configuring-python-classes becomes not the only way.

Thus said, easying unitxt tasks loading may cause too many improvements in all repo, or may be it will be easier to convert unitxt to ConfigurableTask and improve config options through this class.

@elronbandel
Copy link
Contributor Author

elronbandel commented Jul 3, 2024

Current change:

  • Fix any task using "class: !function"
  • Update documentation
  • Add backward compatibility mechanism

Details:

Fixing tasks using "class: !function"

By running grep -rl 'class: \!function' ./ | xargs -n 1 basename | cut -d. -f1

We get everything need to be updated:

unitxt
fda
scrolls_qmsum
scrolls_quality
scrolls_narrativeqa
scrolls_qasper
scrolls_summscreenfd
scrolls_govreport
scrolls_contractnli
squadv2
swde
squad_completion

All the classes can be tested end to end with: lm_eval --model hf --model_args pretrained=google/flan-t5-small --tasks fda,scrolls_qmsum,squadv2,swde,squad_completion --device cpu --limit 10
Output:

Tasks Version Filter n-shot Metric Value Stderr
fda 0 none 0 contains 0.0000 ± N/A
scrolls_qmsum 2 none 0 rouge1 7.5993 ± N/A
none 0 rouge2 0.4706 ± N/A
none 0 rougeL 5.9726 ± N/A
squad_completion 0 none 0 contains 0.2000 ± N/A
squadv2 3 none 0 HasAns_exact 100.0000 ± N/A
none 0 HasAns_f1 100.0000 ± N/A
none 0 NoAns_exact 0.0000 ± N/A
none 0 NoAns_f1 0.0000 ± N/A
none 0 best_exact 70.0000 ± N/A
none 0 best_f1 70.0000 ± N/A
none 0 exact 60.0000 ± N/A
none 0 f1 60.0000 ± N/A
swde 0 none 0 contains 0.3000 ± N/A

Similarly when I revert the change in squad_completion as an example I still get the same result:

Tasks Version Filter n-shot Metric Value Stderr
squad_completion 0 none 0 contains 0.2 ± N/A

This verifies that we have full backward computability.
@LSinev @haileyschoelkopf can you review again?

@haileyschoelkopf
Copy link
Contributor

I will look at this today @elronbandel , sorry for letting it sit so long!

Copy link
Contributor

@haileyschoelkopf haileyschoelkopf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elronbandel the PR looks good to me!

We're just finishing up a couple loose ends on #1741 and will have that merged either later today or, at worst by the weekend due to the US holiday tomorrow (the PR is ready, we're just confirming some behavior changes with task authors).

I'm intending to merge this PR after fixing up merge conflicts and retesting here, as soon as that other one goes though.

@haileyschoelkopf
Copy link
Contributor

Merged into main ; last test failure is just due to tests not installing optional deps for unitxt metrics! LGTM, thanks @elronbandel !

@haileyschoelkopf haileyschoelkopf merged commit ad80f55 into EleutherAI:main Jul 8, 2024
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants