Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Megatron .bin file sharding and unsharding scripts #45

Merged
merged 4 commits into from
Jan 4, 2023

Conversation

haileyschoelkopf
Copy link
Collaborator

@haileyschoelkopf haileyschoelkopf commented Dec 27, 2022

Intended to close #15 .

2 utilities to shard + unshard a .bin file. These should be usable already (tested with Enron) and shouldn't load any files into memory naively, but will test on Pile shortly.

Checklist

  • Write scripts
  • Test with Pile data
  • Write README section on how to use (include instructions to self-tokenize as well)
  • Actually shard + upload both datasets to HF

@StellaAthena
Copy link
Member

I view this as not ready for merging based on the current checklist

@haileyschoelkopf
Copy link
Collaborator Author

I started trying this on the Pile, and it seems like fully reconstructing the .bin file from the shards will not be a whole lot faster than just running the tokenization oneself--set to take ~40 hours to reconstruct (maybe I'm doing something stupid and inefficient, but as far as I know to reshard the memmaped numpy array it requires copying all of the 60 10GB memmapped shards into the full-size memmapped .bin array.

(For reference, C4 took 18 hours to tokenize with NeoX)

@StellaAthena
Copy link
Member

Tokenization can be non-deterministic (don't ask) so I have a preference for desharding even if it is roughly as time consuming as tokenizing from scratch.

@StellaAthena StellaAthena requested review from StellaAthena and removed request for siddk December 28, 2022 05:54
@haileyschoelkopf
Copy link
Collaborator Author

Oh god, is it really nondeterministic on the exact same input string?

I’ll let this run then and we can merge once resharded files are checked for equivalence + once everything’s uploaded.

@haileyschoelkopf
Copy link
Collaborator Author

haileyschoelkopf commented Jan 1, 2023

Running diff against the sharded + unsharded .bin file against the original, the files appear to be the exact same now.

@haileyschoelkopf
Copy link
Collaborator Author

@StellaAthena could you give this a quick review when you get the chance?

Both datasets now being uploaded and the scripts should be ready for merging.

Documentation will come in a separate PR that I'm currently working on.

@haileyschoelkopf haileyschoelkopf merged commit f4c5bb0 into main Jan 4, 2023
lintangsutawika pushed a commit that referenced this pull request Jun 19, 2023
Add Megatron `.bin` file sharding and unsharding scripts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Host Pile pretokenized .bin and .idx megatron files?
2 participants