`code-similarity`

Note

This is just a POC implementation! Sometimes textual similarity is 100% dispite having differences this seems to be an issue related to simhash-ing. One alternative is to use the Vector directly for calculating similarity.

This is a POC implementation of the paper Code Similarity Detection Using AST and Textual Information.
Do note that this deviates a little from the original paper to be a little efficient for on demand computation of similarity.

This stemmed from the need to efficiently find the difference ratio in un-bundled modules from something like a webpack chunk of the current and previous iteration of said chunk for the sake of datamining™.

The implementation works based on xxhash (xxh3_128) and a specialized simhash with a bit of AST parsing. Similar to the original paper the final difference is a weighted calculation between the textual and AST similarity.

The following pseudo-code outlines how the similarity is calculated

algorithm simhash_vector is
      S := lenght of word
      H := xxh3_128_with_seed(word, S)

      for R in 0 to 128
          B := (H >> R) & 1

          if B == 1
              saturatingly increment vector[R] by 1
          else
              saturatingly decrement vector[R] by 1

algorithm simhash_hash is
    H := 0u128

    for shift in 0 to 128
        if vector[shift] > 0
            H |= 1 << shift

    return H

algorithm recurse_ast is
    simhash_vector(vector, get_label(node))

    for each child in node
        recurse_ast(vector, child)

algorithm similarity_params is
    Tᵥ := [0; 128]

    for each whitespace separated word in source
        simhash_vector(Tᵥ, word)

    Tₕ := simhash_hash(Tᵥ)

    Aᵥ := [0; 128]

    root_node := parse_source_ast(source)
    recurse_ast(Aᵥ, root_node)

    Aₕ := simhash_hash(Aᵥ)

    return (Tₕ, Aₕ)

algorithm hamming_similarity is
    H := count_ones(a ^ b)
    S := 1.0 - (H / type_bits_size(H))

    return S

algorithm similarity is
    Wₐ := clamp(weight, 0.0, 1.0)
    Wₜ := 1.0 - Wₐ

    Sₜ := hamming_similarity(a.text, b.text)
    Sₐ := hamming_similarity(a.ast, b.ast)

    S := (Sₜ * Wₜ) + (Sₐ * Wₐ)

    return S

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
scripts		scripts
src		src
test		test
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`code-similarity`

About

Releases

Packages

Languages

arHSM/code-similarity

Folders and files

Latest commit

History

Repository files navigation

code-similarity

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`code-similarity`

Packages