This Python script is designed to compare the text content of two PDF files and determine their similarity using the Jaccard index.
- Python 3.6 or higher
- PyPDF2
- NLTK
- Place the two PDF files that you want to compare in the same directory as the Python script.
- Name the PDFs as
1.pdf
and2.pdf
- Open a terminal window and navigate to the directory containing the script and PDF files.
- Run the script using the following command:
python main.py
- The script will output the Jaccard similarity index as a percentage.
- Its reccommended to uncomment line
9
and10
on the first run. - The script assumes that the text content of the PDF files is in English.
- The script removes stop words and non-alphanumeric characters from the PDF text before performing the similarity check.
- The script does not take into account formatting, layout, or images in the PDF files.
- The script assumes that you've named your PDFs as
1.pdf
and2.pdf
.