Detecting near-duplicated text documents

This is an experiment on detecting near-duplicated text documents using the MinHash LSH algorithm.

Setup

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Download ItemInfo_train.csv.zip and ItemPairs_train.csv.zip from the Avite duplicate ads Kaggle competition into the data/duplicate-ads subdirectory.

python -m dedup.analyzedata
python -m dedup.plot_profile
python -m dedup.plot_histograms

The output will be saved in a subdirectory called results.

Jure Leskovec, Anand Rajaraman, Jeff Ullman: Mining of Massive Datasets, Chapter 3

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dedup		dedup
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt