diff --git a/README.md b/README.md index 587f8ef..7ee5b2e 100644 --- a/README.md +++ b/README.md @@ -104,6 +104,9 @@ git lfs clone https://huggingface.co/datasets/EleutherAI/pythia_deduped_pile_idx python utils/checksum_shards.py python utils/unshard_memmap.py --input_file ./pythia_deduped_pile_idxmaps/pile_0.87_deduped_text_document-00000-of-00082.bin --num_shards 83 --output_dir ./pythia_pile_idxmaps/ + +# The correct sha256 for the full file is 0cd548efd15974d5cca78f9baddbd59220ca675535dcfc0c350087c79f504693 +# This can be checked with sha256sum ./pythia_pile_idxmaps/* ``` This will take over a day to run, though it should not require more than 5 GB of RAM. We recommend downloading this rather than retokenizing the Pile from scratch in order to guarantee preservation of the data order seen by the Pythia models. In addition to the training data, you will need to make a local copy of the tokenizer we used to train our models. You can find it [here](https://github.com/EleutherAI/pythia/blob/main/utils/20B_tokenizer.json).