This is an experiment on detecting near-duplicated text documents using the MinHash LSH algorithm.
- Prepare a Python environment:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
- Download ItemInfo_train.csv.zip and ItemPairs_train.csv.zip from the Avite duplicate ads Kaggle competition into the data/duplicate-ads subdirectory.
python -m dedup.analyzedata
python -m dedup.plot_profile
python -m dedup.plot_histograms
The output will be saved in a subdirectory called results.
Jure Leskovec, Anand Rajaraman, Jeff Ullman: Mining of Massive Datasets, Chapter 3