GitHub - kreeedit/trace: TRACE - Text Reuse Analysis and Comparison Engine

TRACE - Text Reuse Analysis and Comparison Engine

TRACE is a simple Python script that compares the similarities between different text files (even when they are multilingual or paraphrased) using two methods: MinHash and SentenceTransformer. It allows you to specify the directory containing the text (txt) files, the size of the text windows to compare, the step size to move for each new window, and the similarity threshold. It also creates a network graph (text_similarity_network.gexf) of the text similarities to see the relations of the different texts. The result of the analytics is stored in a json file (result.json)

USAGE

The script takes the following arguments:

model (required): The method to use to compare the texts. Must be either "minihash" or "sentencetransformer".
model_type: If you're using the sentencetransformer (default: paraphrase-multilingual-MiniLM-L12-v2).
directory (required): The directory containing the text files to compare.
window_size: The size of each text window. Defaults to 100. Increasing the window size increases the size of the text chunks being compared. A smaller window size may lead to more granular comparison, while a larger window size might lead to more generalized comparison.
step_size: The step size to move for each new window. Defaults to 50. A smaller step size means more overlap between consecutive windows and more comparisons, leading to potentially higher precision but slower computation. A larger step size means less overlap and fewer comparisons.
ngram_size: The size of each n-gram, if you're using the MinHash method. Defaults to 3. Larger n-gram sizes may capture more context but might be less sensitive to small changes, while smaller n-gram sizes might be more sensitive but less context-aware.
similarity_threshold: The threshold for considering two texts to be similar. Defaults to 0.7. Increasing the threshold will result in fewer but potentially higher quality similarities

Minihash

python trace.py --model minihash --directory texts/ --ngram_size 5 --similarity_threshold 0.8

result.json (sample)

    {
        "minhash_similarity": 0.9921875,
        "text1_filename": "text06.txt",
        "text1_text": " The perception of inherent tensions between justice and injustice (or the disproportion of good and",
        "text1_window_start": 0,
        "text2_filename": "text08.txt",
        "text2_text": "The perception of inherent tensions between justice and injustice (or the disproportion of good and ",
        "text2_window_start": 0
    },

Transformer (multilingual paraphrase)

python trace.py --model sentencetransformer --directory texts/ --window_size 100 --step_size 100 --similarity_threshold=0.8

result.json (sample)

 {
        "similarity": 0.8367198705673218,
        "text1_filename": "text01.txt",
        "text1_text": "y were perfectly normal, \nthank you very much. They were the last people you’d \nexpect to be involve",
        "text1_window_start": 100,
        "text2_filename": "text02.txt",
        "text2_text": "ass sie völlig normal waren, \nVielen Dank. Sie waren die letzten Menschen, von denen man \nin etwas S",
        "text2_window_start": 100
    },
{
        "similarity": 0.835121750831604,
        "text1_filename": "text01.txt",
        "text1_text": "ve Nummer vier,\n\nThe Dursleys had everything they wanted, but they \nalso had a secret, and their gre",
        "text1_window_start": 800,
        "text2_filename": "text03.txt",
        "text2_text": " Dursley tenían todo lo que deseaban, pero también un secreto que temían que se descubriera. Creían ",
        "text2_window_start": 1000
    },

Different SentenceTransformer models may produce different results depending on the specific characteristics of the text. Some models may work better with certain languages or domains. For pre-trained models, see S-BERT page and Huggingface. Offered alternative models are: Language-Agnostic BERT Sentence Embedding, all-MiniLM-L6-v2

Network graph

The generated GEXF file can be opened by a variety of graph visualization app, such as Gephi, Cytoscape, and Orange. These applications allow you to view the network data as a graph, and to explore the relationships between the nodes and edges. However, the script also generates a preliminary visualization picture (text_similarity_network.png)

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
daffy-trace.gif		daffy-trace.gif
requirements.txt		requirements.txt
trace.py		trace.py
visualization.png		visualization.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TRACE - Text Reuse Analysis and Comparison Engine

USAGE

Minihash

result.json (sample)

Transformer (multilingual paraphrase)

result.json (sample)

Network graph

About

Releases 2

Packages

Languages

License

kreeedit/trace

Folders and files

Latest commit

History

Repository files navigation

TRACE - Text Reuse Analysis and Comparison Engine

USAGE

Minihash

result.json (sample)

Transformer (multilingual paraphrase)

result.json (sample)

Network graph

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages