Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Host Pile pretokenized .bin and .idx megatron files? #15

Closed
haileyschoelkopf opened this issue Nov 13, 2022 · 7 comments · Fixed by #45
Closed

Host Pile pretokenized .bin and .idx megatron files? #15

haileyschoelkopf opened this issue Nov 13, 2022 · 7 comments · Fixed by #45

Comments

@haileyschoelkopf
Copy link
Collaborator

haileyschoelkopf commented Nov 13, 2022

It might be worth also hosting the Pile .bin and .idx files, for people to more easily reproduce our training runs on the same data if they desire. I don't think the deduplicated Pile has been hosted anywhere before.

@haileyschoelkopf haileyschoelkopf changed the title Host Pile pretokenized .bin and .idx megatron files Host Pile pretokenized .bin and .idx megatron files? Nov 13, 2022
@haileyschoelkopf
Copy link
Collaborator Author

It shouldn't be a huge issue if we don't host them, as the tokenization is deterministic.

@StellaAthena
Copy link
Member

I think we should do this, to make people’s lives easier. Yes the tokenization is deterministic, but it’s a pain to run locally. That’s especially true for people with fewer resources than us.

@haileyschoelkopf
Copy link
Collaborator Author

Completely agree! I guess we can probably host them on the hub?

@StellaAthena
Copy link
Member

Yup, I think that makes the most sense. We could do it in its own file, or we could add it to the existing Pile file. I like the idea of the latter more, but it may be worth opening an issue to ask? We should definitely upload it to the EAI account as a standalone first though, to make it accessible as quickly as possible.

@haileyschoelkopf
Copy link
Collaborator Author

Makes sense! I tried uploading them standalone just yesterday to the hub but the upload (using the Python HF hub api) timed out because of the large filesize, so we may need to talk with HF on how to get a single large file hosted. (we could shard, but if possible would prefer to avoid potentially messing up the file)

the JSONLs are uploaded though, here: https://huggingface.co/datasets/EleutherAI/raw_deduplicated_pile I'll add some documentation soon.

@haileyschoelkopf
Copy link
Collaborator Author

@StellaAthena HF has a filesize cap of ~ 50GB but the .bin files are ~400GB for the deduped Pile and probably 1.5x that for the non-deduped Pile. is there any chance they'd make an exception, or do we need to host these elsewhere or shard the .bin file somehow? (not a huge fan of sharding it)

@StellaAthena
Copy link
Member

I think sharding it is fine, so long as we write a script that reconstructs the shards and test that it does so exactly. We can also provide a checksum so people can verify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants