Skip to content

Latest commit

 

History

History
15 lines (8 loc) · 677 Bytes

readme.md

File metadata and controls

15 lines (8 loc) · 677 Bytes

Plagiarism Detector

I have performed binary classification, that labels whether a file is plagarized or not.

  • Containment and Longest Common Subsesquence have been used as similarity features to find out how similar two files are.

  • Containment calculates common n-grams between two files. It is calculated as follows:

∑𝑐𝑜𝑢𝑛𝑡(ngram𝐴)∩𝑐𝑜𝑢𝑛𝑡(ngram𝑆)/∑𝑐𝑜𝑢𝑛𝑡(ngram𝐴)

  • Longest common subsequence can be calculated using Dynamic Programming.

  • Correlated features are removed and neural network is trained which gives 96% accuracy.

You can find notebook, python code and unittests in this repository.