Skip to content

hishab-nlp/text-dedup-fork

Β 
Β 

Repository files navigation

GitHub Codacy Badge Codacy Badge DOI

Documentation

Github Pages

Features

This repository contains a collection of text deduplication scripts that are ready to use, or modify based on your needs:

  • MinHash + MinHashLSH, including a spark implementation suitable for large (TB) datasets
  • 64 or 128 bit SimHash
  • SuffixArray Substring
  • Bloom Filter
  • Exact Hash (document-level, line-level/ccnet)

I also have big plans for the future:

However, I do not intent to build a general purpose deduplication library, which was the goal of this repo early on. I will gradually retire the pypi package as well. The reason behind it is that each use-case can be wildly different and requires careful design and consideration. I sincerely encourage you to read the script first (they are relatively short) so you can understand what are at stake here when using it. You can use it to bootstrap your own script, or just use it as a reference.

Acknowledgements

This repository is inspired by the following projects, and is heavily influenced by lessons learned from my own participation in BigScience (Apache 2.0) and BigCode (Apache 2.0). There is a blog post about the journey. Feedbacks are welcome!

Quick Examples

Native PySpark

MODIFY text_dedup/minhash_spark.py FOR YOUR OWN PROJECT AND DATASET FIRST!

Assuming you have a downloaded dataset (in parquet files) under "./temp-data", you can process with file with your local compute by:

export PYSPARK_PYTHON="path to your python with scipy, xxhash, and numpy installed"
spark-submit --executor-memory 16g \
    --driver-memory 20g \
    --executor-cores 3 \
    --num-executors 2 \
    --packages graphframes:graphframes:0.8.2-spark3.2-s_2.12 \
    --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=./log4j.properties" \
    --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=./log4j.properties" \
    text_dedup/minhash_spark.py\
    --input "./temp-data" \
    --output "./temp-output" \
    --column "text" \
    --threshold 0.7
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Using B=25, R=10
DEBUG __main__ - Loaded documents: 88803
DEBUG __main__ - args.input='./temp-data'
DEBUG __main__ - args.output='./temp-output'
DEBUG __main__ - args.threshold=0.7
DEBUG __main__ - args.ngram_size=5
DEBUG __main__ - args.min_length=5
DEBUG __main__ - args.num_perm=250
DEBUG __main__ - args.column='text'
DEBUG __main__ - id                                                              : bigint
DEBUG __main__ - text                                                            : string
DEBUG __main__ - meta                                                            : struct<warc_headers:struct<warc-record-id:string,warc-date:string,content-type:string,content-length:int,warc-type:string,warc-identified-content-language:string,warc-refers-to:string,warc-target-uri:string,warc-block-digest:string>,identification:struct<label:string,prob:float>,annotations:array<string>,line_identifications:array<struct<label:string,prob:float>>>
DEBUG __main__ - __id__                                                          : bigint
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Initial edges: 52102
DEBUG __main__ - Edges DataFrame: 52102
DEBUG __main__ - Vertices DataFrame: 50206
DEBUG __main__ - Assignment DataFrame: 50206
DEBUG __main__ - Merging records: 88803
INFO  __main__ - Saving with 1 partitions and 44092 rows each
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Number of rows before:    88803
DEBUG __main__ - Number of rows after:     44092
DEBUG __main__ - Percentage of rows kept:  49.65%
DEBUG __main__ - Output:                   ./temp-output
DEBUG __main__ - Time:                     68.80s
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------

Or take a look at bigcode-v2/run.sh on how to run the job with GCP DataProc.

Suffix Array Substring Exact Deduplication

# input
python -m text_dedup.suffix_array \
    --path "oscar-corpus/OSCAR-2201" \
    --name "gl" \
    --split "train" \
    --cache_dir "./cache" \
    --output "output/suffix_array/oscar_gl_dedup" \
    --column "text" \
    --google_repo_path "/Users/chenghao/Downloads/Projects/text-dedup/deduplicate-text-datasets"

# output
INFO     Loading                       : 2.75 seconds
INFO     Preprocessing                 : 4.78 seconds
INFO     SuffixArray                   : 98.29 seconds
INFO     SelfSimilar                   : 4.24 seconds
INFO     Restore                       : 0.25 seconds
INFO     Deduplicate                   : 6.23 seconds
INFO     Saving                        : 8.91 seconds
INFO     Total                         : 125.45 seconds
INFO     Before                        : 180332342 bytes (88803)
INFO     After                         : 97646271 bytes (40404)

MinHash Near Deduplication

# input
python -m text_dedup.minhash \
  --path "oscar-corpus/OSCAR-2201" \
  --name "gl" \
  --split "train" \
  --cache_dir "./cache" \
  --output "output/minhash/oscar_gl_dedup" \
  --column "text" \
  --batch_size 10000

# output
INFO     Loading                         : 2.62 seconds
INFO     MinHashing                      : 0.08 seconds
INFO     Clustering                      : 2.20 seconds
INFO     Filtering                       : 0.53 seconds
INFO     Saving                          : 9.86 seconds
INFO     Total                           : 15.29 seconds
INFO     Data Number (before)            : 88803
INFO     Data Number (after)             : 44124 (49.69%)
INFO     Duplicate Number                : 44679 (50.31%)
INFO     πŸ€— Happy Deduplicating πŸ€—

For local data folder (csv, jsonl, text etc files)

# input
python -m text_dedup.minhash \
  --path "json" \
  --data_dir "mypath/mydata" \
  --split "train" \
  --cache_dir "./cache" \
  --output "output/minhash/oscar_gl_dedup" \
  --column "text" \
  --batch_size 10000

# output
INFO     Loading                         : 2.62 seconds
INFO     MinHashing                      : 0.08 seconds
INFO     Clustering                      : 2.20 seconds
INFO     Filtering                       : 0.53 seconds
INFO     Saving                          : 9.86 seconds
INFO     Total                           : 15.29 seconds
INFO     Data Number (before)            : 88803
INFO     Data Number (after)             : 44124 (49.69%)
INFO     Duplicate Number                : 44679 (50.31%)
INFO     πŸ€— Happy Deduplicating πŸ€—

SimHash Near Deduplication

# input
python -m text_dedup.simhash \
  --path "oscar-corpus/OSCAR-2201" \
  --name "gl" \
  --split "train" \
  --cache_dir "./cache" \
  --output "output/simhash/oscar_gl_dedup" \
  --column "text" \
  --batch_size 10000

# output
INFO     Loading                         : 2.60 seconds
INFO     SimHashing                      : 0.04 seconds
INFO     Indexing                        : 28.88 seconds
INFO     Filtering                       : 0.88 seconds
INFO     Saving                          : 10.41 seconds
INFO     Total                           : 42.80 seconds
INFO     Data Number (before)            : 88803
INFO     Data Number (after)             : 46163 (51.98%)
INFO     Duplicate Number                : 42640 (48.02%)
INFO     πŸ€— Happy Deduplicating πŸ€—

Exact Hash Exact Deduplication

# input
python -m text_dedup.exact_hash \
    --path "oscar-corpus/OSCAR-2201" \
    --name "gl" \
    --split "train" \
    --cache_dir "./cache" \
    --output "output/exact_hash/oscar_gl_dedup" \
    --column "text" \
    --batch_size 1000

# output
INFO     Loading                       : 2.95s
INFO     Processing                    : 3.79s
INFO     Filtering                     : 0.10s
INFO     Saving                        : 2.89s
INFO     Total                         : 9.72s
INFO     Before                        : 88803
INFO     After                         : 47049

Bloom Filter Exact Deduplication

# input
python -m text_dedup.bloom_filter \
    --path "oscar-corpus/OSCAR-2201" \
    --name "gl" \
    --split "train" \
    --cache_dir "./cache" \
    --output "output/bloom_filter/oscar_gl_dedup" \
    --error_rate 1e-5 \
    --column "text" \
    --batch_size 1000

# output
INFO     Loading                       : 2.72s
INFO     Processing                    : 4.84s
INFO     Filtering                     : 0.10s
INFO     Saving                        : 2.88s
INFO     Total                         : 10.54s
INFO     Before                        : 88803
INFO     After                         : 47045

Benchmarks

pinecone/core-2020-05-10-deduplication

See tests/test_benchmark_core.py for reproduction.

Algorithm Precision (Duplicates) Recall (Duplicates) Precision (Non Duplicates) Recall (Non Duplicates) Macro F1 score Accuracy Time
MinHash Spark 0.957 0.9445 0.9471 0.959 0.952 0.9202 698.76s
MinHash 0.9594 0.9445 0.9474 0.9616 0.9534 0.924 18.80s
SimHash** 0.9007 0.6786 0.7681 0.9343 0.8344 0.8137 253.94s
Exact Title 0.8302 0.5521 0.7098 0.9065 0.77 0.7456 -
Exact Title Matching 1 0.830 0.50 0.709 0.992 0.757 0.746 -
Simhash Matching 1 0.697 0.247 0.598 0.985 0.631 0.616 -
Document Vector Similarity 1 0.912 0.779 0.861 0.986 0.885 0.883 -
Hybrid Method 1 0.908 0.828 0.899 0.979 0.904 0.903 -
LaBSE2 0.937 0.923 0.930 0.943 0.933 0.919 -
Multilingual USE2 0.917 0.907 0.918 0.927 0.917 0.909 -
Multilingual E5-Base2 0.931 0.908 0.919 0.939 0.924 0.920 -
MinHash + LSH2 0.929 0.902 0.915 0.938 0.921 0.918 -
RETSimPartial-Dup2 0.945 0.941 0.945 0.949 0.945 0.928 -
RETSimNear-Dup2 0.928 0.937 0.942 0.934 0.935 0.926 -

NEWS-COPY

See tests/test_benchmark_news.py for reproduction.

Adjusted Rand Index (ARI) on NEWS-COPY dataset:

Model/Algorithm ARI
n-gram 3 0.440
SimHash 0.612
SimHash2 0.695
MinHash 0.742
MinHash3 0.737
MinHash2 0.783
Multilingual USE2 0.730
Multilingual E5-Base2 0.742
S-BERT3 0.700
RETSim Partial-Dup2 0.831
RETSim Near-Dup2 0.704
Re-ranking 3 0.937
Bi-encoder 3 0.915

Note

  1. Best SimHash result from benchmarks/hyperparameter.ipynb
  2. Spark implementation has some overhead for small datasets, so I recommend using the script only when you have a large dataset and enough compute resources.

License

Apache 2.0

Citations

Generally, you can cite this repository as:

@software{chenghao_mou_2023_8364980,
  author       = {Chenghao Mou and
                  Chris Ha and
                  Kenneth Enevoldsen and
                  Peiyuan Liu},
  title        = {ChenghaoMou/text-dedup: Reference Snapshot},
  month        = sep,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {2023.09.20},
  doi          = {10.5281/zenodo.8364980},
  url          = {https://doi.org/10.5281/zenodo.8364980}
}

The spark version was born from BigCode (Apache 2.0) and BigScience (Apache 2.0), and you can cite the original paper if you want:

@article{
kocetkov2023the,
title={The Stack: 3 {TB} of permissively licensed source code},
author={Denis Kocetkov and Raymond Li and Loubna Ben allal and Jia LI and Chenghao Mou and Yacine Jernite and Margaret Mitchell and Carlos Mu{\~n}oz Ferrandis and Sean Hughes and Thomas Wolf and Dzmitry Bahdanau and Leandro Von Werra and Harm de Vries},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=pxpbTdUEpD},
note={}
}

Footnotes

  1. (Gyawali et al., LREC 2020) ↩ ↩2 ↩3 ↩4

  2. RETSim: Resilient and Efficient Text Similarity ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12

  3. Noise-Robust De-Duplication at Scale ↩ ↩2 ↩3 ↩4 ↩5

Releases

No releases published

Packages

 
 
 

Languages

  • Python 65.2%
  • Jupyter Notebook 33.6%
  • Other 1.2%