Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pile tasks on big-refactor use dataset_names from old dataset loader that don't exist on HF #731

Open
yeoedward opened this issue Aug 3, 2023 · 2 comments
Assignees
Labels
bug Something isn't working. good first issue Good for newcomers help wanted Contributors and extra help welcome.

Comments

@yeoedward
Copy link
Contributor

Task example: https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/tasks/pile/pile_arxiv.yaml#L7

HF dataset: https://huggingface.co/datasets/EleutherAI/pile

Original dataset loader prior to big-refactor: https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/datasets/pile/pile.py

@haileyschoelkopf mentioned that using this loading script should work if we upload it to HF and point the Pile tasks to that new dataset.

@StellaAthena StellaAthena added bug Something isn't working. help wanted Contributors and extra help welcome. good first issue Good for newcomers labels Aug 8, 2023
@pratyushmaini
Copy link

pratyushmaini commented Aug 11, 2023

Adding the file "pile.py" at "lm-evaluation-harness/EleutherAI/the_pile/the_pile.py" does indeed fix the issue. Additionally changing the test split to "test" in pile_arxiv.yaml (line 9)

This recipe works pretty fast, but I observe this strange trend where the first few samples are processed slow (which is understandable), the middle samples are processed at an extremely fast speed, and then in the end the last few samples again take a lot of time. When using "accelerate launch" this almost halts forever (I eventually killed the process after waiting for a few minutes), whereas using a single GPU does allow me to get final output.

@pratyushmaini
Copy link

Just an update to the above. Since PILE is no longer public now, you may want to modify the _URLS to your local path to pile. This is line 44 of the current file pile.py

_URLS = { "validation": "/data/the_pile/val.jsonl.zst", "test": "/data/the_pile/test.jsonl.zst", }

Also, there have been some changes to the repo since the last comment. The file should be placed at
"lm-evaluation-harness/EleutherAI/pile/pile.py"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. good first issue Good for newcomers help wanted Contributors and extra help welcome.
Projects
Status: Backlog
Development

No branches or pull requests

4 participants