Skip to content

aajanki/minhash-dedup

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Detecting near-duplicated text documents

This is an experiment on detecting near-duplicated text documents using the MinHash LSH algorithm.

Setup

  1. Prepare a Python environment:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
  1. Download ItemInfo_train.csv.zip and ItemPairs_train.csv.zip from the Avite duplicate ads Kaggle competition into the data/duplicate-ads subdirectory.

Run the experiments

python -m dedup.analyzedata
python -m dedup.plot_profile
python -m dedup.plot_histograms

The output will be saved in a subdirectory called results.

References

Jure Leskovec, Anand Rajaraman, Jeff Ullman: Mining of Massive Datasets, Chapter 3

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages