Pile tasks on big-refactor use dataset_names from old dataset loader that don't exist on HF #731

yeoedward · 2023-08-03T15:29:45Z

Task example: https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/tasks/pile/pile_arxiv.yaml#L7

HF dataset: https://huggingface.co/datasets/EleutherAI/pile

Original dataset loader prior to big-refactor: https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/datasets/pile/pile.py

@haileyschoelkopf mentioned that using this loading script should work if we upload it to HF and point the Pile tasks to that new dataset.

The text was updated successfully, but these errors were encountered:

pratyushmaini · 2023-08-11T18:43:05Z

Adding the file "pile.py" at "lm-evaluation-harness/EleutherAI/the_pile/the_pile.py" does indeed fix the issue. Additionally changing the test split to "test" in pile_arxiv.yaml (line 9)

This recipe works pretty fast, but I observe this strange trend where the first few samples are processed slow (which is understandable), the middle samples are processed at an extremely fast speed, and then in the end the last few samples again take a lot of time. When using "accelerate launch" this almost halts forever (I eventually killed the process after waiting for a few minutes), whereas using a single GPU does allow me to get final output.

pratyushmaini · 2023-10-26T17:28:35Z

Just an update to the above. Since PILE is no longer public now, you may want to modify the _URLS to your local path to pile. This is line 44 of the current file pile.py

_URLS = { "validation": "/data/the_pile/val.jsonl.zst", "test": "/data/the_pile/test.jsonl.zst", }

Also, there have been some changes to the repo since the last comment. The file should be placed at
"lm-evaluation-harness/EleutherAI/pile/pile.py"

StellaAthena added bug Something isn't working. help wanted Contributors and extra help welcome. good first issue Good for newcomers labels Aug 8, 2023

StellaAthena added this to the v0.3.0 milestone Nov 8, 2023

StellaAthena self-assigned this Nov 8, 2023

StellaAthena assigned haileyschoelkopf Nov 20, 2023

haileyschoelkopf mentioned this issue Jan 23, 2024

Pile dataset not found #1338

Closed

haileyschoelkopf mentioned this issue Feb 19, 2024

Error: BuilderConfig not found #1444

Closed

haileyschoelkopf removed this from the v0.4.0 milestone Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pile tasks on big-refactor use dataset_names from old dataset loader that don't exist on HF #731

Pile tasks on big-refactor use dataset_names from old dataset loader that don't exist on HF #731

yeoedward commented Aug 3, 2023

pratyushmaini commented Aug 11, 2023 •

edited

pratyushmaini commented Oct 26, 2023

Pile tasks on big-refactor use dataset_names from old dataset loader that don't exist on HF #731

Pile tasks on big-refactor use dataset_names from old dataset loader that don't exist on HF #731

Comments

yeoedward commented Aug 3, 2023

pratyushmaini commented Aug 11, 2023 • edited

pratyushmaini commented Oct 26, 2023

pratyushmaini commented Aug 11, 2023 •

edited