README now is much more readable

norabelrose · Feb 24, 2022 · 9cb2706 · 9cb2706
1 parent b682f7c
commit 9cb2706
Showing 1 changed file with 128 additions and 18 deletions.
diff --git a/README.md b/README.md
@@ -27,7 +27,11 @@ If you use this repository or our deduplicated datasets you can cite
 
 # Exact Deduplication Code
 
-We provide an implementation of the exact deduplication technique used in the paper. This is very much research code. It is (a very slightly cleaned up) version of exactly what we do in the paper. It assumes that you want to deduplicate something the size of C4 (~300GB) running on a machine with 96 cores and >600GB of RAM. If you only want to use this for reasonably-sized datasets, you should change the number of parallel threads from 96 to something smaller. If your machine is big enough, there should be no upper bound on the size of the dataset it can handle (well, 2^64-1 bytes is the limit, but I think we can all agree that's essentially unlimited).
+We provide an implementation of the exact deduplication technique used in the paper.
+This is very much research code: it works well for what we designed it to do, but probably not much more.
+We did clean it up fairly significantly for a Vversion 1.0.0 release (see below for release history).
+If you want to deduplicate small (<10GB) datasets, it should work on any modern machine with 16GB of RAM and a few CPUs.
+If you want to deduplicate something the size of C4 (~300GB) you will want a machine with as many cores as you can get (we used 96 cores) and >600GB of RAM. If your machine is big enough, there should be no upper bound on the size of the dataset it can handle (well, 2^64-1 bytes is the limit, but I think we can all agree that's essentially unlimited).
 
 
 We build a suffix array (based on [Andrew Gallant's suffix array implementation](https://github.com/BurntSushi/suffix/)) in [src/table.rs](src/table.rs). It has some minor changes from the original version that make it so we can't just import this library as a crate. First, we need 64-bit integers. The original implementation says that u32 works for "reasonably sized documents (~4GB)" but we're working with unreasonably sized documents. So we need u64. Second, we don't want UTF8 strings. Everything is a [u8] byte array, because we might be working over token sequences which aren't valid UTF8.
@@ -36,7 +40,10 @@ The main complication in the rest of [src/main.rs](src/main.rs) is the fact that
 
 ## Version History
 
-Version 0.1.0 was an initial code release that reproduces the paper. The code was terrible.
+Version 0.1.0 was an initial code release that reproduces the paper.
+- The code worked, but was rather terrible.
+- I am sorry if you had to look at it.
+- You don't want to look at this code unless you're explicitly trying to reproduce our paper.
 
 Version 1.0.0 is complete restructuring of the code. IT IS NOT BACKWARDS COMPATIBLE.
 - The suffix array data structure is basically the only thing that remains unchanged (thanks to Andrew Gallant who actually understood how to write code). You won't need to re-generate the suffix array tables.
@@ -46,7 +53,7 @@ Version 1.0.0 is complete restructuring of the code. IT IS NOT BACKWARDS COMPATI
 
 ## Installing
 
-If you just want to run the rust deduplicator, then you will only need to install Rust:
+To run the rust deduplicator you will need to install Rust:
 
 ```curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh```
 
@@ -56,7 +63,10 @@ If you additionally want to generate datasets to run the rust script on (and you
 
 ## Basic Usage
 
-If you just want to reproduce the result of this paper, or deduplicate any language model that's already in the [Tensorflow Datasets (TFDS)](https://www.tensorflow.org/datasets) format, then you can just run the following commands:
+This section walks through the code for getting started using it.
+Later we'll cover how to actually deduplicate a dataset, for now we'll just walk through the basics for how it works.
+
+Start by running
 
 ```cargo build```
 
@@ -68,46 +78,133 @@ For example, to get the LM1B training set you could run `python3 scripts/load_da
 
 If the dataset is really big, you might want to add the `--tokenize` flag. This will shrink the dataset by roughly a factor of two by tokenizing it with the GPT-2 tokenizer.
 
-And then to construct the suffix array run
+This will create a file that's called `data/lm1b.test` and `data/lm1b.test.size`.
+The first file contains the entire LM1b test set smashed together, and the second file has the byte offset of where each individual training example begins, in sorted order.
+
+From here we can now build a suffix array of this entire dataset that's now in a single file.
 
 ```python3 scripts/make_suffix_array.py [path/to/dataset]```
 
-For example, if you run `python3 scripts/make_suffix_array.py data/lm1b.test`, this will create a file `data/lm1b.test.table.bin` containing the suffix array. Again, this should be fast, about two hours on the LM1B train set when run single-thread and a few minutes on 96 cores.
+For example, if you run `python3 scripts/make_suffix_array.py data/lm1b.test`, this will create a file `data/lm1b.test.table.bin` containing the suffix array. Again, this should be fast. The test set should process in just a few seconds. Or if you're running on the LM1b train set, it will take about two hours when run single-thread and a few minutes on 96 cores.
 
 (If you get an error that you have too many open files, that's because this script opens lots of files. You should run `ulimit -Sn 1000000` to "fix" the error. You might want to do this preemptively before hitting this crash after hour ten of the job.)
 
 ### Querying a suffix array to find duplicated examples
 
-Start by loading and building a suffix array for a dataset as described above
-
-Once you have the suffix array, you now query the dataset to find all occurrences of a particular string. To do this, run
+We're not yet going to deduplicate a dataset.
+To start, let's just see how to count how often a particular example has been repeated.
+To do this, run
 
 ```python3 scripts/count_occurrences.py --suffix [path/to/dataset] [--query query_string] [--query_file /path/to/query]```
 
-On the LM1B test set, running `python3 scripts/count_occurrences.py --suffix data/lm1b.test --query " on Tuesday" should return 1288. If you tokenized the dataset, then you should pass `--tokenize` to `count_occurrences.py` as well, to get the same result (plus or minus tokenization differences).
+This should be very fast. Even when you run on a dataset that's 100s of gigabytes, it should take a few seconds, most of which is dominated by Python starting up. The actual core lookup just requires O(log(dataset_size)) time, which often is on the order of ~miliseconds.
 
+On the LM1B test set, running `python3 scripts/count_occurrences.py --suffix data/lm1b.test --query " on Tuesday" should return 1288. If you tokenized the dataset, then you should pass `--tokenize` to `count_occurrences.py` as well, to get the same result (plus or minus tokenization differences).
 
 If you want to confirm this the outputted number is correct (assuming you haven't tokenized), you can run `cat /tmp/lm1b.test | grep -ao " on Tuesday"` and get the same result.
 
 ## Deduplicating a Dataset
 
-Once you've built the suffix array for a dataset, the next step is to identify all substrings that are repeated within the dataset.
+Now let's explain how to deduplicate a dataset as we do in the paper. As a running example we'll continue with the LM1b test set.
+
+
+### Finding all repeated substrings
 
-To do this, you can run the command TODO.
+The first step in deduplicating a dataset is identifying all substrings of a given length that are repeated more than some threshold number of times. To do this we run the `self-similar` command:
 
+```
+cargo run self-similar --data-file /tmp/data/lm1b.test --length-threshold 100 --cache-dir /tmp/cache --num-threads 8
 ```
 
+For larger datasets, you may want to replace num-threads with as many cores as you have on your machine. It parallelizes perfectly, so there's no reason not to. For now though, keep it at 8 just for the sake of keeping things on track with this guide.
 
-## A full end-to-end deduplication example
+This will probably end by saying something like
+
+```
+Duplicates found: 28464
+```
+
+This means that the deduplicator found 28,464 sequences of length 100 that existed somewhere else in the dataset. The length threshold here is entirely dataset-dependent. In our paper, we used 50 tokens (which is 100 bytes---so remember that if you pass --tokenize you'll need to double the number of bytes for the length threshold).
 
-As a demo for how this code works, you can run the following two commands
+At this point the deduplicator will have dumped a bunch of files to a cache directory. There are two kinds of files here
+- /cache/dups_$DATASET_A-B
+- /cache/sizes_$DATASET_A-B
 
+Each `dups` file is a list of u64 pointers into the dataset that corresponds to sequences repeated multiple times. Each file has the duplicates that correspond to items A through B in the suffix array. There should be 28,464 total entries when added up across all of these files. The duplicates are all clustered together, so all duplicates of the same string should appear sequentiallyp.
+
+Each `sizes` file says how large the cluster sizes are, again as a u64. This is typicall a small number.
+
+The above explanation might be confusing. Let's see an example. Let's fine the first duplicate in the dataset:
 ```
-bash scripts/scripts/run_pipeline.sh
-python3 scripts/finish_dedup_lm1b.py --data_dir ~/tensorflow_datasets/ --save_dir /tmp/dedup --name lm1b --split test --suffixarray_dir /tmp/data --remove /tmp/lm1b.test.remove.byterange
+$ xxd /tmp/cache/sizes_lm1b.test_0-5444411 | head -n 1
+00000000: 0200 0000 0000 0000 0200 0000 0000 0000 ................
+$ xxd /tmp/cache/dups_lm1b.test_0-5444411 | head -n 1
+00000000: a429 7000 0000 0000 a9a8 5f00 0000 0000 .)p......._.....
+```
+
+then this is telling me that the first cluster of duplicates is of size 2, and starts at location 0x7029a4 in the data file,
+with the second occurrence at location 0x5fa8a9. To confirm this, you can run
+```
+$ python3
+Python 3.7.3 (default, Jan 22 2021, 20:04:44)
+>>> open("/tmp/data/lm1b.test","rb").read()[0x7029a4:0x7029a4+100]
+b'\x00\x00The proposal for temporary curbs from the Financial Stability Board will be submitted to leaders o'
+>>> open("/tmp/data/lm1b.test","rb").read()[0x5fa8a9:0x5fa8a9+100]
+b'\x00\x00The proposal for temporary curbs from the Financial Stability Board will be submitted to leaders o'
 ```
 
-This will run the entire deduplication pipeline top-to-bottom, starting with loading the LM1b test set, then creating a suffix array, finding all repeated sequences, merging them together to sequence ranges, and finally spitting out a deduplicated TFDataSet.
+And we've confirmed that this example is correctly identified twice in the dataset.
+(Exercise for the reader: how would you count how many times this string is repeated in the dataset? It should be twice. Can you check that?)
+
+
+### Collecting the duplicates together
+
+The next step is to take all of the length-100 sequences we've found and collect them together to figure out what we should be removing from our dataset.
+To see why this is necessary, imagine that we have a length-200 sequence that's repeated more than once.
+The current data we have would tag this sequence as being a duplicate 99 times---once for each initial byte where a match occurrs.
+
+This step reduces that down to just find ranges of bytes [a,b) which are duplicated more than once.
+To do this, run
+```
+cargo run collect --data-name lm1b.test --cache-dir /tmp/ab1 --length-threshold 100 > /tmp/lm1b.test.remove.byterange
+```
+
+The output here will be a long list of byte pair ranges
+```
+...
+out
+185290 185564
+424048 424148
+482724 482824
+534604 534716
+...
+```
+
+What this means is that the substring in the dataset from byte 185290 to byte 185564 is repeated more than once and should be removed.
+Let's check this.
+```
+$ python3
+Python 3.7.3 (default, Jan 22 2021, 20:04:44)
+>>> data=open("/tmp/data/lm1b.test","rb").read()
+>>> data[185290:185564]
+b' to use their wireless phones concurrently to make calls ; send and receive email and text , picture and video messages ; access the Internet; view high-quality videos ; and download music , games and ringtones , while enjoying clearer reception and fewer dropped calls .\xff\xff'
+>>> data.count(data[185290:185564])
+2
+```
+
+Looks great! Now that we have this file, we can go back and actually deduplicate the dataset.
+In our paper we suggest just taking all of these duplicate sequences that have been identified and completely striking them from the dataset.
+This somewhat breaks the flow of text, for example if previously had an example "Alice wanted to go to the store" and we deduplicated at the level of 10 characters, we might completely strike " to go to the " and be left with "Alice wantedstore".
+In practice we have found this doesn't break the language model because we removely relatively little text, and so these breaks don't cause harm.
+
+How exactly how you write out a dataset that's been deduplicated depends on the format the dataset started as.
+If you're just running this on LM1b, we've provided a script to do this conversion for you which will output another valid TensorFlow Dataset directory. But if you're using some other dataset, this is the part you'll have to take over and write the rest.
+
+To run the LM1b script, you can just run this command
+
+```
+python3 scripts/finish_dedup_lm1b.py --data_dir ~/tensorflow_datasets/ --save_dir /tmp/dedup --name lm1b --split test --suffixarray_dir /tmp/data --remove /tmp/lm1b.test.remove.byterange
+```
 
 You can verify the deduplication has succeeded by then re-running the pipeline using the resulting output. Instead of finding 28,464 duplicate sequences during the deduplication phase, it should instead find 92. Importantly, you can check that these 92 duplicates are not errors of the pipeline: they are new sequences that are now duplicated when previously they were not. You can check this by running `count-occurrences` in the original dataset for the sequences that (now) have two occcurrences.
 
@@ -116,6 +213,20 @@ Why do we get new duplicates? Consider the following example where we're going t
 To generate the result of our paper, we ran the deduplicator twice. This often cuts the number of duplicates down by over 100,000x, which in pratice means to ~zero for normal datasets or ~a few hundred for massive 100GB+ datasets.
 
 
+
+## A full end-to-end deduplication example
+
+Okay so maybe you don't like reading. You skpped the entire section above. (Honestly I don't blame you.) You just want it to run.
+Then just do this
+
+```
+bash scripts/scripts/run_pipeline.sh
+python3 scripts/finish_dedup_lm1b.py --data_dir ~/tensorflow_datasets/ --save_dir /tmp/dedup --name lm1b --split test --suffixarray_dir /tmp/data --remove /tmp/lm1b.test.remove.byterange
+```
+
+This will run the entire deduplication pipeline top-to-bottom, starting with loading the LM1b test set, then creating a suffix array, finding all repeated sequences, merging them together to sequence ranges, and finally spitting out a deduplicated TF Dataset that you can use exactly as normal.
+
+
 ## Advanced Usage
 
 The above scripts work by calling into the core Rust suffix array deduplicator. If you want to do each step yourself, the following options are available:
@@ -136,7 +247,6 @@ To build a suffix array for an extremely large file (e.g., ~about as much RAM as
 
 This script will build the suffix array in parallel by splitting the single file into chunks, generating suffix arrays for each chunk, and then merging the suffix arrays together to form the full suffix array. Note that in general this algorithm is quadratic, but when the maximum substring length is short relative to the total file length (as it is, when generating suffix arrays for N independent training examples) it will never reach this worst case behavior.
 
-
 The two steps are described below.
 
 #### Building a piece of a suffix array from a piece of a file