Skip to content

SetSketch: Filling the Gap between MinHash and HyperLogLog

Notifications You must be signed in to change notification settings

YuqiZ2020/set-sketch-test

 
 

Repository files navigation

Adapting SetSketch for alignment-free sequence comparison

This repository adapts the SetSketch algorithm to perform alignment free genomic sequence comparison following the specifications set by the AFproject.

How to run

  1. Clone repository including submodules:

    git clone --recursive https://github.com/YuqiZ2020/set-sketch-test.git
    
  2. Switch to set-sketch-test directory:

    cd set-sketch-paper
    
  3. Upload your FASTA files into the test folder using the following structure

    src\
       c++\
       fasta_utils.hpp
       set-sketch-4-af-benchmark.cpp
    test\
       A.fasta
       B.fasta
       ...
    
  4. Build the benchmark executable

    g++ -O3 -std=c++17 -fopenmp -Wall src/set-sketch-4-af-benchmark.cpp -o benchmark.out
    
  5. Running the benchmark will generate all-pairs distance between FASTA sequences in the test folder.

    ./benchmark.out test/ output.txt 
    
  6. If you want to change different parameters, you can run:

    g++ -O3 -std=c++17 -fopenmp -Wall src/set-sketch-4-af-w-param.cpp -o benchmark-param.out
    ./benchmark-param.out <data_folder>/ output_file.txt <num-register> <base> <a> <q> <k>
    

    for example,

    ./benchmark-param.out test/ output.txt 12 1.5 20 62 11
    

MATLAB GUI for SetSketch

We implement a simpler version of SetSketch in MATLAB and provides a GUI in the folder MATLAB_GUI. A detailed README about the GUI can be found in the folder.

Our Contribution

  1. src/fasta_utils.hpp are written by Feiyang

  2. src/set-sketch-4-af-benchmark.cpp and src/set-sketch-4-af-2-param.cpp are written by Yuqi and Feiyang

  3. MATLAB_GUI/* are written by Wenxuan

  4. Codes in the cwd for hyperparameter tunning are written by Trisha, Yuqi, and Wenxuan

  5. The remaining codes are from https://github.com/dynatrace-research/set-sketch-paper.git

SetSketch

The original algorithm by Otmar Ertl was presented in the paper "SetSketch: Filling the Gap between MinHash and HyperLogLog" which was accepted at VLDB 2021. An extended paper version that includes mathematical proofs and additional results is available on arXiv. The author's original implementation in C++ and additional Python visualization tools is available on Github.

About

SetSketch: Filling the Gap between MinHash and HyperLogLog

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 41.9%
  • C++ 35.0%
  • Python 18.2%
  • C 3.4%
  • Shell 1.5%