Skip to content

Commit

Permalink
Simplify process
Browse files Browse the repository at this point in the history
  • Loading branch information
carlini committed Mar 7, 2022
1 parent fc1bd78 commit 91be30f
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 4 deletions.
1 change: 0 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,5 @@ overflow-checks = false # Go FAAASSTTT!
[dependencies]
zstd = "0.5"
crossbeam = "0.3"
fasthash = "0.4"
filebuffer = "0.4"
clap = { version = "3.1.1", features = ["derive"] }
10 changes: 8 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,9 @@ To run the rust deduplicator you will need to install Rust:

```curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh```

If you additionally want to generate datasets to run the rust script on (and you probably do) then you will need python dependencies:
You'll also need a C compier, `sudo apt-get install gcc` will do that if you don't already.

If you additionally want to generate datasets to run the rust script on (and you probably do, at least to follow this demo) then you will need python dependencies:

```pip3 install numpy scipy tensorflow tensorflow_datasets transformers sentencepiece```

Expand All @@ -60,7 +62,11 @@ to compile the rust code, and then run

```python3 scripts/load_dataset.py --data_dir $LOAD_DIR --save_dir $SAVE_DIR --name $DATASET --split $SPLIT [--tokenize]```

For example, to get the LM1B training set you could run `python3 scripts/load_dataset.py --data_dir ~/tensorflow_datasets --save_dir data --name lm1b --split test`. This should will take just a minute or so to run on the test set or about an hour if running with the `train` set instead.
For example, to get the LM1B test set (you should do this, to walk through the demo) run

```python3 scripts/load_dataset.py --data_dir ~/tensorflow_datasets --save_dir data --name lm1b --split test```

This should will take just a minute or so to run on the test set or about an hour if running with the `train` set instead.

If the dataset is really big, you might want to add the `--tokenize` flag. This will shrink the dataset by roughly a factor of two by tokenizing it with the GPT-2 tokenizer.

Expand Down
1 change: 0 additions & 1 deletion src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,6 @@ extern crate filebuffer;
extern crate zstd;
extern crate crossbeam;
extern crate clap;
extern crate fasthash;

use std::cmp::Ordering;
use std::collections::BinaryHeap;
Expand Down

0 comments on commit 91be30f

Please sign in to comment.