-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difference between LFS and HuggingFace datasets? #112
Comments
Ahh, I meant to follow up to amend that readme discussion--the HF datasets https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated and https://huggingface.co/datasets/EleutherAI/raw_deduplicated_pile are NOT the same shuffle order as the LFS pretokenized dataset. However, those datasets should be the same order as the data that was used with GPT-NeoX's I could see about there being another way for us to distribute the data, so that you don't need to have 2 copies of the |
In order to regenerate the ordering of the examples used during training, the README suggests downloading the dataset from LFS. I'm having issues with this process because LFS downloads two copies of the data, and I can fit only one, not two, on my hard drive. However, according to the discussion here, it seems like the datasets on HuggingFace also preserve the training order. If I just want to see the ordering of the samples, is there any reason not to just use the HuggingFace data?
Thanks!
The text was updated successfully, but these errors were encountered: