Skip to content

arHSM/code-similarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

code-similarity

Note

This is just a POC implementation! Sometimes textual similarity is 100% dispite having differences this seems to be an issue related to simhash-ing. One alternative is to use the Vector directly for calculating similarity.

This is a POC implementation of the paper Code Similarity Detection Using AST and Textual Information.
Do note that this deviates a little from the original paper to be a little efficient for on demand computation of similarity.

This stemmed from the need to efficiently find the difference ratio in un-bundled modules from something like a webpack chunk of the current and previous iteration of said chunk for the sake of datamining™.

The implementation works based on xxhash (xxh3_128) and a specialized simhash with a bit of AST parsing. Similar to the original paper the final difference is a weighted calculation between the textual and AST similarity.

The following pseudo-code outlines how the similarity is calculated

algorithm simhash_vector is
      S := lenght of word
      H := xxh3_128_with_seed(word, S)

      for R in 0 to 128
          B := (H >> R) & 1

          if B == 1
              saturatingly increment vector[R] by 1
          else
              saturatingly decrement vector[R] by 1

algorithm simhash_hash is
    H := 0u128

    for shift in 0 to 128
        if vector[shift] > 0
            H |= 1 << shift

    return H

algorithm recurse_ast is
    simhash_vector(vector, get_label(node))

    for each child in node
        recurse_ast(vector, child)

algorithm similarity_params is
    Tᵥ := [0; 128]

    for each whitespace separated word in source
        simhash_vector(Tᵥ, word)

    Tₕ := simhash_hash(Tᵥ)

    Aᵥ := [0; 128]

    root_node := parse_source_ast(source)
    recurse_ast(Aᵥ, root_node)

    Aₕ := simhash_hash(Aᵥ)

    return (Tₕ, Aₕ)

algorithm hamming_similarity is
    H := count_ones(a ^ b)
    S := 1.0 - (H / type_bits_size(H))

    return S

algorithm similarity is
    Wₐ := clamp(weight, 0.0, 1.0)
    Wₜ := 1.0 - Wₐ

    Sₜ := hamming_similarity(a.text, b.text)
    Sₐ := hamming_similarity(a.ast, b.ast)

    S := (Sₜ * Wₜ) + (Sₐ * Wₐ)

    return S