In this directory are all of the scripts, including now-obsolete or one-off ones, used for processing, analyzing, and ablating the Pile.
Replication scripts are listed in approximate order needed for replication.
pass2_shuffle_holdout.py
: Script for pass 2 of the shuffling. The first pass is handled in Pile repo if--interleave
is used. Pass 2 is basically going through each of the interleaved outputs and shuffling it. For more info on why this works see https://blog.janestreet.com/how-to-shuffle-a-big-dataset/. This step also creates the holdout set, from which val and test are created.dedupe_train.py
: This script removes all exact-match data in the held-out sets (including test and val) from the training set. This is very important because otherwise there's leakage between train and val/test. Fuzzy matching is out of the scope of this script.
lang_len_analysis_pass1.py
: Runs analysis for length in {chars, bytes, tokens, words} and language. Saves the result as .jsonl.zst files which need a second pass to aggregate, but this first pass is the more expensive one anyways, and this means we can make nice histograms and stuff. Should be run withTOKENIZERS_PARALLELISM=false
for max performance since it prevents thread thrashing. This script would be a useful template for other future analysis.lang_len_analysis_pass2.py
: Pass 2 for langth/language analysis. Aggregates and makes plots.profanity_analysis_pass1.py
: Profanity analysis pass 1.ablation_dedupe/make_excludes_lambada_wikitext.py
: For ablation; detokenizes LAMBADA and wikitext in preparation for eval-dedupe. Thie script should be obsolete now;write_out.py
in lm_evaluation_harness handles many more sets. TODO: write detailed guide on how to usewrite_out.py
ablation_dedupe/make_deduped.py
: For ablation; performs decontamination of training data against validation/test data. Runmake_excludes_lambada_wikitext
orwrite_out.py
first. TODO: clean up and make official validation-dedupe script.
repack_arxiv.py
: packages the arxiv tar.gz into a lmd archive.pile_proportions_sanitycheck.py
: shows the proportions of a sample of a Pile output to make sure the proportions are about rightgithub_reduce.py
: One off script for cutting down github to a manageable size. Pile repo used to pull all 600GB of github each time but that's kinda ridiculous since we only use 95GB of it.join.py
: Script for joining multiple lmd archives. Much faster than actually using lmd because we're not actually parsing the json.fix_empty_lines.py
: One-off script for fixing extra newlines in lmd archives. Shouldn't be too useful for replication but included for completeness.