Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What tool do you use for your data preprocessing/binarization? #69

Closed
ajesujoba opened this issue Feb 22, 2023 · 9 comments
Closed

What tool do you use for your data preprocessing/binarization? #69

ajesujoba opened this issue Feb 22, 2023 · 9 comments

Comments

@ajesujoba
Copy link

ajesujoba commented Feb 22, 2023

Hi, I am trying to train a GPT model from scratch using your training script. However, you have only provided your preprocessed data without the preprocessing script. Would it be possible to share the preprocessing scripts to generate the .bin and .idx files?

@ajesujoba ajesujoba changed the title What tool do you use for your data preprocessing? What tool do you use for your data preprocessing/binarization? Feb 22, 2023
@haileyschoelkopf
Copy link
Collaborator

haileyschoelkopf commented Feb 24, 2023

Hi! We use https://github.com/EleutherAI/gpt-neox/blob/main/prepare_data.py to preprocess our data.

In particular, for the Pile, we would run python prepare_data.py pile -d /path/to/data -t HFTokenizer -v 20B_tokenizer.json where 20B_tokenizer.json is the file in this repo found here: https://github.com/EleutherAI/pythia/blob/main/utils/20B_tokenizer.json .

In theory, this should be deterministic, but in practice if you would like to fully replicate our dataset + exact shuffling setup we recommend using the provided files to be on the safe side.

@zplizzi
Copy link

zplizzi commented Mar 23, 2023

In #76 you said that you used preprocess_data.py to generate the dataset, but here you say you use prepare_data.py.

I'm specifically asking about the tool used to generate https://huggingface.co/datasets/EleutherAI/pythia_deduped_pile_idxmaps. If preprocess_data.py, it would be useful to know the exact args you used (eg i see flags about EOD token, ftfy, etc that could affect the results).

@zplizzi
Copy link

zplizzi commented Mar 23, 2023

Oh, I see now that prepare_data.py calls preprocess_data.py.

@haileyschoelkopf
Copy link
Collaborator

Yep, sorry for the lack of clarity on my part! We used prepare_data.py and this command I provided above should be the one used, with flags passed to preprocess_data.py given by the defaults there (none overridden from defaults for the Pile in prepare_data.py). So EOD should be added after each doc, and ftfy was not used. I still might recommend the resharding script provided in the utils folder to be sure you get the right .bin and .idx file, as running prepare_data.py was not performed by me so I can't tell you an exact commit id. However, neither prepare_data.py, tools/corpora.py, nor tools/preprocess_data.py have been changed in relevant ways over the past year, so exact commit should not affect this.

@zplizzi
Copy link

zplizzi commented Mar 26, 2023

Thank you! One more question - it doesn't look like that script has an option for generating the deduplicated pile dataset. It's easy to imagine how to extend it to pull in the deduplicated dataset, but I wondered if you happened to have the code used to generate https://huggingface.co/datasets/EleutherAI/pythia_deduped_pile_idxmaps? Always prefer not to guess when doing reproductions :)

@haileyschoelkopf
Copy link
Collaborator

Totally fair! Have you tried doing

git lfs clone https://huggingface.co/datasets/EleutherAI/pythia_pile_idxmaps

python utils/unshard_memmap.py --input_file ./pythia_pile_idxmaps/pile_0.87_deduped_text_document-00000-of-00082.bin --num_shards 83 --output_dir ./pythia_pile_idxmaps/

as described in the README of this repository? I'd recommend this as the most surefire way to get exactly the same file I've got.

If you've tried this and it doesn't work, or is for some reason not a viable option for you, I can go back and confirm I get the same result when running prepare_data.py using the JSONL files from here: https://huggingface.co/datasets/EleutherAI/raw_deduplicated_pile/tree/main to see if that will also work. These raw files are the same ones used, but I can't guarantee that you can just do this and have exact match on the ordering because the deduped .bin and .idx files were generated by someone else before I worked on this project, so maybe the files were fed in in a different order or something.

@stabilize-ai
Copy link

following up from a previously closed thread,

do the 2nd and 3rd links below contain the same examples as the ones used for pythia training and in the same order ?

https://huggingface.co/datasets/EleutherAI/pythia_deduped_pile_idxmaps (tokenized + sharded)
https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated (jsonl)
https://huggingface.co/datasets/EleutherAI/raw_deduplicated_pile (raw, parquet)

@StellaAthena
Copy link
Member

Yes

@haileyschoelkopf
Copy link
Collaborator

This is actually not the case--the HF datasets https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated and https://huggingface.co/datasets/EleutherAI/raw_deduplicated_pile are NOT the same shuffle order as the LFS pretokenized dataset. See #112 for more detail

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants