-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Host Pile pretokenized .bin and .idx megatron files? #15
Comments
It shouldn't be a huge issue if we don't host them, as the tokenization is deterministic. |
I think we should do this, to make people’s lives easier. Yes the tokenization is deterministic, but it’s a pain to run locally. That’s especially true for people with fewer resources than us. |
Completely agree! I guess we can probably host them on the hub? |
Yup, I think that makes the most sense. We could do it in its own file, or we could add it to the existing Pile file. I like the idea of the latter more, but it may be worth opening an issue to ask? We should definitely upload it to the EAI account as a standalone first though, to make it accessible as quickly as possible. |
Makes sense! I tried uploading them standalone just yesterday to the hub but the upload (using the Python HF hub api) timed out because of the large filesize, so we may need to talk with HF on how to get a single large file hosted. (we could shard, but if possible would prefer to avoid potentially messing up the file) the JSONLs are uploaded though, here: https://huggingface.co/datasets/EleutherAI/raw_deduplicated_pile I'll add some documentation soon. |
@StellaAthena HF has a filesize cap of ~ 50GB but the |
I think sharding it is fine, so long as we write a script that reconstructs the shards and test that it does so exactly. We can also provide a checksum so people can verify. |
It might be worth also hosting the Pile
.bin
and.idx
files, for people to more easily reproduce our training runs on the same data if they desire. I don't think the deduplicated Pile has been hosted anywhere before.The text was updated successfully, but these errors were encountered: