-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What tool do you use for your data preprocessing/binarization? #69
Comments
Hi! We use https://github.com/EleutherAI/gpt-neox/blob/main/prepare_data.py to preprocess our data. In particular, for the Pile, we would run In theory, this should be deterministic, but in practice if you would like to fully replicate our dataset + exact shuffling setup we recommend using the provided files to be on the safe side. |
In #76 you said that you used I'm specifically asking about the tool used to generate https://huggingface.co/datasets/EleutherAI/pythia_deduped_pile_idxmaps. If |
Oh, I see now that |
Yep, sorry for the lack of clarity on my part! We used |
Thank you! One more question - it doesn't look like that script has an option for generating the deduplicated pile dataset. It's easy to imagine how to extend it to pull in the deduplicated dataset, but I wondered if you happened to have the code used to generate https://huggingface.co/datasets/EleutherAI/pythia_deduped_pile_idxmaps? Always prefer not to guess when doing reproductions :) |
Totally fair! Have you tried doing
as described in the README of this repository? I'd recommend this as the most surefire way to get exactly the same file I've got. If you've tried this and it doesn't work, or is for some reason not a viable option for you, I can go back and confirm I get the same result when running |
following up from a previously closed thread, do the 2nd and 3rd links below contain the same examples as the ones used for pythia training and in the same order ? https://huggingface.co/datasets/EleutherAI/pythia_deduped_pile_idxmaps (tokenized + sharded) |
Yes |
This is actually not the case--the HF datasets https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated and https://huggingface.co/datasets/EleutherAI/raw_deduplicated_pile are NOT the same shuffle order as the LFS pretokenized dataset. See #112 for more detail |
Hi, I am trying to train a GPT model from scratch using your training script. However, you have only provided your preprocessed data without the preprocessing script. Would it be possible to share the preprocessing scripts to generate the .bin and .idx files?
The text was updated successfully, but these errors were encountered: