Skip to content

Scalable graph based indices for approximate nearest neighbor search

License

Notifications You must be signed in to change notification settings

michaelyhuang23/DiskANN

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DiskANN for DPC

Build:

mkdir build && cd build && cmake -DCMAKE_BUILD_TYPE=Release .. && make -j 

Run:

cd build/tests
./blindprobe_dpc --K 50 --L 64 --Lnn 64 --max_degree 64 --Lbuild 64 --density_cutoff 0.001 --dist_cutoff 1300 --Dbrute 1000 --query_file ../../data/mnist.data --output_file ../../results/mnist_blindprobe.out --decision_graph_file ../../results/mnist_blindprobe.graph

--K is the number of k nearest neighbors used for density computation

--L is the L parameter for density computation (should be larger than K)

--Lnn is the L parameter for dependent point finding

--max_degree is the max_degree of the graph

--Lbuild is used for graph construction

--density_cutoff decides which points are noise points

--dist_cutoff decides which non-noise points are cluster centers

--Dbrute decides how many high density points to bruteforce dependent points

--query_file is the txt source file. It should contain $n$ rows, each with $dim$ floats

--output_file stores the clustering results

--decision_graph_file stores the density and dependent point results

Files:

/tests/blindprobe_dpc.cpp is the main dpc algorithm. It uses a priority search queue to go through points close to the query point to find the dependent point. The density it computes is an average result of several nearest neighbors, which helps stabilize the result.

/tests/doubling_dpc.cpp is an archaic dpc algorithm. It doubles search range each time, until the dependent point is found. There's a parallelism bug in the diskann library which I believe is related to resizing the scratch. This algorithm triggers that bug.

/tests/bruteforce_dpc.cpp is just the bruteforce dpc algorithm. It computes a simpler version of density compared to blindprobe_dpc. It takes $O(n^2)$ time and linear memory.

/python/plot_decision_graph.py is a helper program that plots decision graph based on the outputed decision_graph_file

python/cluster_eval.py is a helper program that computes various clustering metrics based on ground truth and the output clustering file output_file

tip: using diskann::cout can cause parallelism error, so using std::cout may be better

Just nearest neighbor search usage:

Please see the following pages on using the compiled code:

About

Scalable graph based indices for approximate nearest neighbor search

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 94.6%
  • Python 2.6%
  • CMake 2.4%
  • Other 0.4%