Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md #134

Merged
merged 1 commit into from
Nov 15, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,9 @@ git lfs clone https://huggingface.co/datasets/EleutherAI/pythia_deduped_pile_idx
python utils/checksum_shards.py

python utils/unshard_memmap.py --input_file ./pythia_deduped_pile_idxmaps/pile_0.87_deduped_text_document-00000-of-00082.bin --num_shards 83 --output_dir ./pythia_pile_idxmaps/

# The correct sha256 for the full file is 0cd548efd15974d5cca78f9baddbd59220ca675535dcfc0c350087c79f504693
# This can be checked with sha256sum ./pythia_pile_idxmaps/*
```
This will take over a day to run, though it should not require more than 5 GB of RAM. We recommend downloading this rather than retokenizing the Pile from scratch in order to guarantee preservation of the data order seen by the Pythia models. In addition to the training data, you will need to make a local copy of the tokenizer we used to train our models. You can find it [here](https://github.com/EleutherAI/pythia/blob/main/utils/20B_tokenizer.json).

Expand Down