Skip to content

Commit

Permalink
Updated README, fixing some typos.
Browse files Browse the repository at this point in the history
  • Loading branch information
daphnei authored Jul 18, 2021
1 parent 54afbba commit 501db10
Showing 1 changed file with 25 additions and 35 deletions.
60 changes: 25 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,7 @@
# Deduplicating Training Data Makes Language Models Better

This repository contains code to deduplicate language model datasets as descrbed in the paper ["Deduplicating Training Data Makes Language Models Better"](https://arxiv.org/abs/2107.06499) by Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch and Nicholas Carlini.
Language model datasets are often created by scraping raw text from the Internet.
Often, this means the same sequences will be repeated multiple times (e.g., we find a single 50 word sequence that is repeated in the C4 dataset 60,000 times).
Training models on deduplicated datasets is faster (because they see fewer total examples) and experimentally results in models with similar perplexity.

This repository contains both the ExactSubstr deduplication implementation (written in Rust) along with the scripts we used in the paper to perform deduplication and inspect the results (written in python).
This repository contains both the ExactSubstr deduplication implementation (written in Rust) along with the scripts we used in the paper to perform deduplication and inspect the results (written in Python).
In an upcoming update, we will add files to reproduce the NearDup-deduplicated versions of the C4, RealNews, LM1B, and Wiki-40B-en datasets.

This is not an officially supported Google product.
Expand All @@ -23,13 +19,12 @@ If you use this repository or our deduplicated datasets you can cite
}
```

# Exact Deduplication Code
## Exact Deduplication Code

We provide an implementation of the exact deduplication technique used in the paper. This is very much research code. It is (a very slightly cleaned up) version of exactly what we do in the paper. It assumes that you want to deduplicate something the size of C4 (~300GB) running on a machine with 96 cores and >600GB of RAM. If you only want to use this for reasonably-sized datasets, you should change the number of parallel threads from 96 to something smaller. If your machine is big enough, there should be no upper bound on the size of the dataset it can handle (well, 2^64-1 bytes is the limit, but I think we can all agree that's essentially unlimited).


We build a suffix array (based on [Andrew Gallant's suffix array implementation](https://github.com/BurntSushi/suffix/)) in [src/table.rs](src/table.rs). It has some minor changes from the original version that make it so we can't just import this library as a crate. First, we need 64-bit integers. The original implementation says that u32 works for "reasonably sized documents (~4GB)" but we're working with unreasonably sized documents. So we need u64. Second, we don't want UTF8 strings. Everything is a [u8] byte array, because we might be working over token sequences which aren't valid UTF8.

The main complication in the rest of [src/main.rs](src/main.rs) is the fact that we want things to run in parallel, and we probably can't fit the entire suffix array into memory. And so all of our algorithms are designed around these constraints.

If you just want to run the rust deduplicator, then you will only need to install Rust:
Expand All @@ -40,53 +35,50 @@ If you additionally want to generate datasets to run the rust script on (and you

```pip3 install numpy scipy tensorflow tensorflow_datasets transformers sentencepiece```

## Basic Usage
### Basic Usage

If you just want to reproduce the result of this paper, or deduplicate any language model that's already in the [Tensorflow Datasets (TFDS)](https://www.tensorflow.org/datasets) format, then you can just run the following commands:

```cargo build```

to compile the rust code, and then run

```python3 scripts/load_dataset.py --data_dir $LOAD_DIR --save_dir $SAVE_DIR --name $DATASET --split $SPLIT [--tokenize]```

For example, to get the LM1B training set you could run `python3 scripts/load_dataset.py --data_dir ~/tensorflow_datasets --save_dir data --name lm1b --split test`. This should will take just a few seconds to run, or ~an hour if running with the `train` set instead.
For example, to get the LM1B training set you could run `python3 scripts/load_dataset.py --data_dir ~/tensorflow_datasets --save_dir data --name lm1b --split test`. This should will take just a few seconds to run on the test set or about an hour if running with the `train` set instead.

If the dataset is really big, you might want to add the --tokenize flag. This will shrink the dataset by roughly a factor of two, tokenizing it with the GPT-2 tokenizer.
If the dataset is really big, you might want to add the `--tokenize` flag. This will shrink the dataset by roughly a factor of two by tokenizing it with the GPT-2 tokenizer.

And then to construct the suffix array run

```python3 scripts/make_suffix_array.py [path/to/dataset]```

For example run `python3 scripts/make_suffix_array.py data/lm1b.test` and this will creat a file `data/lm1b.test.table.bin` containing the suffix array. Again this should be fast, or about two hours on the train set (on one thread, a few minutes on 96 cores).
For example, if you run `python3 scripts/make_suffix_array.py data/lm1b.test`, this will create a file `data/lm1b.test.table.bin` containing the suffix array. Again, this should be fast, about two hours on the LM1B train set when run single-thread and a few minutes on 96 cores.

(If you get an error that you have too many open files, that's because this script opens lots of files. You should run `ulimit -Sn 1000000` to "fix" it.)
(If you get an error that you have too many open files, that's because this script opens lots of files. You should run `ulimit -Sn 1000000` to "fix" the error.)

### Querying a suffix array to find duplicated examples
#### Querying a suffix array to find duplicated examples

Start by loading and building a suffix array for a dataset as described above

Once you have the suffix array, you now query the dataset to find all occurances of a particular string. To do this, run

```python3 scripts/count_occurances.py --suffix [path/to/suffix_array] [--query query_string] [--query_file /path/to/query]```

On the LM1B test set, running `python3 scripts/count_occurances.py --suffix data/lm1b.test --quedry " on Tuesday"` should return 1288. If you tokenized the dataset then you should pass `--tokenize` this time as well, to get the same result (plus or minus tokenization differences).
On the LM1B test set, running `python3 scripts/count_occurances.py --suffix data/lm1b.test --query " on Tuesday" should return 1288. If you tokenized the dataset, then you should pass `--tokenize` to `count_occurences.py` as well, to get the same result.

If you want to confirm this number is correct (and you haven't tokenized) then you can run `cat /tmp/lm1b.test | grep -ao " on Tuesday"` and get the same result.

## Advanced Usage
If you want to confirm this the outputted number is correct (assuming you haven't tokenized), you can run `cat /tmp/lm1b.test | grep -ao " on Tuesday"` and get the same result.

The above scripts works by calling into the core rust suffix array deduplicator. If you want to do this yourself, it has the following options available:
### Advanced Usage

### Single threaded suffix array construction
The above scripts work by calling into the core Rust suffix array deduplicator. If you want to do each step yourself, the following options are available:

#### Single threaded suffix array construction

To build a suffix array for any particular file, you can run

```cargo run save [file_path]```

This will create a file called `[file_path].table.bin` which contains the suffix array for the file provided. This algorithm is linear time, but (a) only runs on a single core, and (b) has memory requirement `O(big * len(file))` which is prohibitive for large files.

### Parallel suffix array construction
#### Parallel suffix array construction

To build a suffix array for an extremely large file (e.g., ~about as much RAM as available) it is better to run the script

Expand All @@ -97,15 +89,15 @@ This script will build the suffix array in parallel by splitting the single file

The two steps are described below.

#### Building a piece of a suffix array from a piece of a file
##### Building a piece of a suffix array from a piece of a file

The first generats a suffix array from a piece of a file. This is implemented by running

```cargo run save_part [file_path] [byte_start] [byte_end]```

And builds a suffix array for the byte sequence between [byte_start] and [byte_end] for the given file. Multiple of these can be run in parallel to build a suffix array for a file quickly.

#### Merging suffix array pieces to create a single suffix array
##### Merging suffix array pieces to create a single suffix array

Given the several independent suffix arrays, merging them is now just a matter of calling

Expand All @@ -115,19 +107,17 @@ to generate a collection of ordered suffix arrays pieces in the output directory

```cat [tmp_output_directory]/* > [file_path].table.bin```

### Finding duplicated examples

Given a suffix array for a file (either by running it single threaded, or performing a parallel construction) it can now be queried for interesting statistics.
#### Finding Duplicates

The simplest operation can count occurrances of particular substrings in O(log(N)) time and O(query_length) memory requirements (as shown above with `scripts/count_occurances.py`). To do this you can run
Given a suffix array file, as generated in the prevous section, it can now be queried for interesting statistics.
The simplest operation, counting occurrences of particular substrings, takes O(log(N)) time and O(query_length) memory requirements, (as shown above with `scripts/count_occurances.py`). To do this you can run:

```cargo run count_occurances /path/to/dataset /path/to/query_file```

(Indeed, the python script is just a wrapper that makes calling this nicer, with the option for tokenization.)
This is useful mainly as a commandline interface to interact with the dataset to find interesting properties. To run more sophisticated analysis, use the tools described below:

This is useful mainly as a commandline interface to interact with the dataset to find interesting properties. To run more sophisticated analysis there are other tools.

### Finding duplicates between two documents
##### Finding duplicates between two documents

Given a document A and another document B, we can find all duplicates betwen the two by (1) constructing suffix arrays for both, and then (2) linearly walking the suffix arrays in order to find all duplicates of a given length.

Expand All @@ -143,14 +133,14 @@ The second step is then to run

```cargo run collect_similar [dataset2]```. This converts the result to instead compute ranges so that instead we have dataset2[xi:yi] match.

#### Finding duplicates within one document
##### Finding duplicates within one document

To find duplicates that are contained within one document (for example, to actually deduplicate a dataset as we do in the paper) run the command

```cargo run selfsimilar_parallel [dataset]```

This will find all repeated substrings contained in the dataset above a given length threshold. Again run collect_similar to find the indexs of repeated examples.

# Approx Deduplication Results
## Approx Deduplication Results

Coming soon
Coming soon.

0 comments on commit 501db10

Please sign in to comment.