Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Host Deduped Pile raw jsonls #27

Closed
haileyschoelkopf opened this issue Nov 29, 2022 · 5 comments
Closed

Host Deduped Pile raw jsonls #27

haileyschoelkopf opened this issue Nov 29, 2022 · 5 comments

Comments

@haileyschoelkopf
Copy link
Collaborator

haileyschoelkopf commented Nov 29, 2022

I don't think that the deduped Pile raw text data is hosted anywhere--I couldn't find it on the eye. Even if we don't host the deduped Pile .bin and .idx files somewhere, we definitely need to host the raw deduped Pile data to make these experiments replicable.

@aflah02
Copy link
Contributor

aflah02 commented Dec 17, 2022

@haileyschoelkopf Thanks for the great work on this repo! Is there any ETA for the Pile dataset hosting? Would be really helpful for some analysis

@haileyschoelkopf
Copy link
Collaborator Author

haileyschoelkopf commented Dec 17, 2022

Hi, thanks for the interest!

I've uploaded the deduplicated Pile in Parquet format on HF datasets, which may be helpful:
https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated

I'll try to upload the JSONLs today too, running into issues with filesize for our .bin and .idx files from NeoX.

Please let me know if you have any questions about replicating data ordering in training. We'll have something up on replicating the data seen by the model very soon!

@aflah02
Copy link
Contributor

aflah02 commented Dec 18, 2022

Thanks a ton, I'll reach out if I face some issues!

@haileyschoelkopf
Copy link
Collaborator Author

https://huggingface.co/datasets/EleutherAI/raw_deduplicated_pile BTW, all the deduped Pile jsonl files are here now :)

@aflah02
Copy link
Contributor

aflah02 commented Dec 18, 2022

Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants