What tool do you use for your data preprocessing/binarization? #69

ajesujoba · 2023-02-22T09:37:54Z

Hi, I am trying to train a GPT model from scratch using your training script. However, you have only provided your preprocessed data without the preprocessing script. Would it be possible to share the preprocessing scripts to generate the .bin and .idx files?

haileyschoelkopf · 2023-02-24T01:59:50Z

Hi! We use https://github.com/EleutherAI/gpt-neox/blob/main/prepare_data.py to preprocess our data.

In particular, for the Pile, we would run python prepare_data.py pile -d /path/to/data -t HFTokenizer -v 20B_tokenizer.json where 20B_tokenizer.json is the file in this repo found here: https://github.com/EleutherAI/pythia/blob/main/utils/20B_tokenizer.json .

In theory, this should be deterministic, but in practice if you would like to fully replicate our dataset + exact shuffling setup we recommend using the provided files to be on the safe side.

zplizzi · 2023-03-23T20:15:40Z

In #76 you said that you used preprocess_data.py to generate the dataset, but here you say you use prepare_data.py.

I'm specifically asking about the tool used to generate https://huggingface.co/datasets/EleutherAI/pythia_deduped_pile_idxmaps. If preprocess_data.py, it would be useful to know the exact args you used (eg i see flags about EOD token, ftfy, etc that could affect the results).

zplizzi · 2023-03-23T20:29:21Z

Oh, I see now that prepare_data.py calls preprocess_data.py.

haileyschoelkopf · 2023-03-25T22:33:56Z

Yep, sorry for the lack of clarity on my part! We used prepare_data.py and this command I provided above should be the one used, with flags passed to preprocess_data.py given by the defaults there (none overridden from defaults for the Pile in prepare_data.py). So EOD should be added after each doc, and ftfy was not used. I still might recommend the resharding script provided in the utils folder to be sure you get the right .bin and .idx file, as running prepare_data.py was not performed by me so I can't tell you an exact commit id. However, neither prepare_data.py, tools/corpora.py, nor tools/preprocess_data.py have been changed in relevant ways over the past year, so exact commit should not affect this.

zplizzi · 2023-03-26T12:36:49Z

Thank you! One more question - it doesn't look like that script has an option for generating the deduplicated pile dataset. It's easy to imagine how to extend it to pull in the deduplicated dataset, but I wondered if you happened to have the code used to generate https://huggingface.co/datasets/EleutherAI/pythia_deduped_pile_idxmaps? Always prefer not to guess when doing reproductions :)

haileyschoelkopf · 2023-03-26T17:03:48Z

Totally fair! Have you tried doing

git lfs clone https://huggingface.co/datasets/EleutherAI/pythia_pile_idxmaps

python utils/unshard_memmap.py --input_file ./pythia_pile_idxmaps/pile_0.87_deduped_text_document-00000-of-00082.bin --num_shards 83 --output_dir ./pythia_pile_idxmaps/

as described in the README of this repository? I'd recommend this as the most surefire way to get exactly the same file I've got.

If you've tried this and it doesn't work, or is for some reason not a viable option for you, I can go back and confirm I get the same result when running prepare_data.py using the JSONL files from here: https://huggingface.co/datasets/EleutherAI/raw_deduplicated_pile/tree/main to see if that will also work. These raw files are the same ones used, but I can't guarantee that you can just do this and have exact match on the ordering because the deduped .bin and .idx files were generated by someone else before I worked on this project, so maybe the files were fed in in a different order or something.

stabilize-ai · 2023-04-15T00:16:50Z

following up from a previously closed thread,

do the 2nd and 3rd links below contain the same examples as the ones used for pythia training and in the same order ?

https://huggingface.co/datasets/EleutherAI/pythia_deduped_pile_idxmaps (tokenized + sharded)
https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated (jsonl)
https://huggingface.co/datasets/EleutherAI/raw_deduplicated_pile (raw, parquet)

StellaAthena · 2023-04-18T16:35:34Z

Yes

haileyschoelkopf · 2023-07-03T13:47:37Z

This is actually not the case--the HF datasets https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated and https://huggingface.co/datasets/EleutherAI/raw_deduplicated_pile are NOT the same shuffle order as the LFS pretokenized dataset. See #112 for more detail

ajesujoba changed the title ~~What tool do you use for your data preprocessing?~~ What tool do you use for your data preprocessing/binarization? Feb 22, 2023

StellaAthena closed this as completed Apr 18, 2023

eric-mitchell mentioned this issue Jun 28, 2023

Difference between LFS and HuggingFace datasets? #112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What tool do you use for your data preprocessing/binarization? #69

What tool do you use for your data preprocessing/binarization? #69

ajesujoba commented Feb 22, 2023 •

edited

Loading

haileyschoelkopf commented Feb 24, 2023 •

edited

Loading

zplizzi commented Mar 23, 2023

zplizzi commented Mar 23, 2023

haileyschoelkopf commented Mar 25, 2023

zplizzi commented Mar 26, 2023

haileyschoelkopf commented Mar 26, 2023

stabilize-ai commented Apr 15, 2023

StellaAthena commented Apr 18, 2023

haileyschoelkopf commented Jul 3, 2023

What tool do you use for your data preprocessing/binarization? #69

What tool do you use for your data preprocessing/binarization? #69

Comments

ajesujoba commented Feb 22, 2023 • edited Loading

haileyschoelkopf commented Feb 24, 2023 • edited Loading

zplizzi commented Mar 23, 2023

zplizzi commented Mar 23, 2023

haileyschoelkopf commented Mar 25, 2023

zplizzi commented Mar 26, 2023

haileyschoelkopf commented Mar 26, 2023

stabilize-ai commented Apr 15, 2023

StellaAthena commented Apr 18, 2023

haileyschoelkopf commented Jul 3, 2023

ajesujoba commented Feb 22, 2023 •

edited

Loading

haileyschoelkopf commented Feb 24, 2023 •

edited

Loading