This repository adapts the SetSketch algorithm to perform alignment free genomic sequence comparison following the specifications set by the AFproject.
-
Clone repository including submodules:
git clone --recursive https://github.com/YuqiZ2020/set-sketch-test.git
-
Switch to
set-sketch-test
directory:cd set-sketch-paper
-
Upload your FASTA files into the test folder using the following structure
src\ c++\ fasta_utils.hpp set-sketch-4-af-benchmark.cpp test\ A.fasta B.fasta ...
-
Build the benchmark executable
g++ -O3 -std=c++17 -fopenmp -Wall src/set-sketch-4-af-benchmark.cpp -o benchmark.out
-
Running the benchmark will generate all-pairs distance between FASTA sequences in the test folder.
./benchmark.out test/ output.txt
-
If you want to change different parameters, you can run:
g++ -O3 -std=c++17 -fopenmp -Wall src/set-sketch-4-af-w-param.cpp -o benchmark-param.out ./benchmark-param.out <data_folder>/ output_file.txt <num-register> <base> <a> <q> <k>
for example,
./benchmark-param.out test/ output.txt 12 1.5 20 62 11
We implement a simpler version of SetSketch in MATLAB and provides a GUI in the folder MATLAB_GUI. A detailed README about the GUI can be found in the folder.
-
src/fasta_utils.hpp are written by Feiyang
-
src/set-sketch-4-af-benchmark.cpp and src/set-sketch-4-af-2-param.cpp are written by Yuqi and Feiyang
-
MATLAB_GUI/* are written by Wenxuan
-
Codes in the cwd for hyperparameter tunning are written by Trisha, Yuqi, and Wenxuan
-
The remaining codes are from https://github.com/dynatrace-research/set-sketch-paper.git
The original algorithm by Otmar Ertl was presented in the paper "SetSketch: Filling the Gap between MinHash and HyperLogLog" which was accepted at VLDB 2021. An extended paper version that includes mathematical proofs and additional results is available on arXiv. The author's original implementation in C++ and additional Python visualization tools is available on Github.