Locally run code similarity detection for Python files and Jupyter Notebooks
Runs pairwise comparisons between Python code from .py and .ipynb files to determine:
- Code similarity: Compares code contents after substituting variable names and running other normalizations. Uses a windowed, fingerprinting approach to come up with vector representations for the submissions, which are then compared using cosine similarity. Based primarily on https://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf.
- Output similarity: Compares similarity between .ipynb output cells (again using the cosine similarity metric)
- Variable similarity: Compares similarity between all variable names, including function and class names, using Jaccard similarity calculation.
- Structure similarity: Compares the abstract syntax trees (ASTs) for different submissions using Jaccard similarity.
Example output:
{
"repoA": "https://github.com/cs544-wisc/project-6-repo-A",
"repoB": "https://github.com/cs544-wisc/project-6-repo-B",
"code": 0.0258,
"variables": 0.357,
"output": 0.9634,
"struture": 0.7879,
"visualize": "python3 display.py 'project-6-repo-A' 'project-6-repo-B'"
}
Note: The code within the files must be syntactically valid since CoSi uses AST parsing to extract code structure.
- Add a new directory (e.g.
project-10
) within./files/submissions
containing individual student submission directories - Update the
.env
file; setSUBMISSIONS_ROOT_DIR
as the folder name (project-10/
) andFILES_IN_SUBMISSION
as the list of files to be read from each sub-directory ofSUBMISSIONS_ROOT_DIR
e.g (p10.py
ornb/client.py,nb/server.py
). Also, set theGITHUB_PREFIX
as the URL prefix to add to the directory name (used in the resulting CSV) or leave blank if not applicable - Run
python3 CoSi.py
- The result is saved as a CSV in
./files/results
identified by the time of generation. AFAILED
csv is also included if parsing of any files fails
The result csv contains a column to view differences between the code in two submissions. In general, to visualize the text-diff of two submissions, set the SUBMISSIONS_ROOT_DIR
and FILES_IN_SUBMISSION
fields in .env
. Then execute python3 display.py 'repoA' 'repoB'
, where repoA
and repoB
are student directories within the SUBMISSIONS_ROOT_DIR
. This will open a browser window with a side-by-side comparison of all files from FILES_IN_SUBMISSION
in the two repos.
Note: If repoA
and repoB
are present in SUBMISSIONS_ROOT_DIR
, the saved version of the files will be displayed. If any of the repo names passed is not saved locally, it will be cloned from git into SUBMISSIONS_ROOT_DIR
first.
Note: The script uses diffcheck.com public APIs to retreive the diff HTML (https://api.diffchecker.com/public/text). Use of diffcheck API must abide by their API terms.
An example submission is provided for which CoSi may be executed immediately.
- To run CoSi for the example, execute
python3 CoSi.py
- To view the results, see the csv file generated in
results/
- To visualize the diff between the example submission pair, run
python3 display.py 'project-6-repo-B' 'project-6-repo-A'