If you are interested in the research line of nearest neighbor search or application and looking for a cooperation please feel free to contect me via [email protected]
======
SSG is a graph-based approximate nearest neighbor search (ANNS) algorithm. It provides a flexible and efficient solution for the metric-free large-scale ANNS on dense real vectors. It implements the algorithm of our paper: Satellite System Graph: Towards the Efficiency Up-Boundary of Graph-Based Approximate Nearest Neighbor Search
Data set | Download | dimension | nb base vectors | nb query vectors | original website |
---|---|---|---|---|---|
SIFT1M | original website | 128 | 1,000,000 | 10,000 | original website |
GIST1M | original website | 128 | 1,000,000 | 1,000 | original website |
Crawl | crawl.tar.gz (1.7GB) | 300 | 1,989,995 | 10,000 | original website |
GloVe-100 | glove-100.tar.gz (424MB) | 100 | 1,183,514 | 10,000 | original website |
Deep100M | deep100m.tar.gz* (34GB) | 96 | 100,000,000 | 10,000 | original website |
- For Deep100M we will provide the download link upon request
The performance was tested without parallelism. Among all the graph-based algorithms, NSG and SSG has the smallest index size.
Compared Algorithms:
Please see our NSG paper for the performance of other graph-based algorithms - FANNG.
- Prerequisite : openmp, cmake, boost
- Compile:
- Go to the root directory of faiss, it's under the directory of extern_libraries aside of ours.
- Execute the following commands:
$ cd /path/to/project
$ mkdir -p build && cd build
$ cmake .. && make -j
The main interfaces and classes have its respective test codes under directory tests/
.
Please follow the instructions below to build the SSG index.
Firstly, we need to prepare a kNN graph.
We suggest you use our efanna_graph to build this kNN graph. But you can also use any alternatives you like, such as KGraph or faiss.
For example:
$ cd /path/to/project/build/tests/
$ ./test_ssg_index data_path knn_graph_path L R Angle ssg_path
- data_path is the path of the origin data.
- knn_graph_path is the path of the pre-built kNN graph.
- L controls the quality of the NSG, the larger the better, L > R.
- R controls the index size of the graph, the best R is related to the intrinsic dimension of the dataset.
- Angle controls the angle between two edges.
- ssg_path is the path where the result SSG stored.
For example:
$ cd /path/to/project/build/tests/
$ ./test_ssg_optimized_search data_path query_path ssg_path search_L search_K result_path [random_seed]
- data_path is the path of the origin data.
- query_path is the path of the query data.
- ssg_path is the path of the pre-built SSG.
- search_L controls the quality of the search results, the larger the better but slower (must larger than search_K).
- search_K controls the number of neighbors we want to find.
- result_path is the path of the result neighbors.
- random_seed (optional) is the random seed.
NOTE: For now, we only provide interface for search for only one query at a time, and test the performance with single thread.
NOTE: Data alignment is essential for the correctness of our procedure, because we use SIMD instructions for acceleration of numerical computing such as AVX and SSE2. You should use it to ensure your data elements (feature) is aligned with 8 or 16 int or float. For example, if your features are of dimension 70, then it should be extend to dimension 72. And the last 2 dimension should be filled with 0 to ensure the correctness of the distance computing. And this is what data_align() does.
NOTE: Only data-type int32 and float32 are supported for now.
$ cd /path/to/project/
$ python setup.py install
NOTE: currently Python API only supports
search
method.
import numpy as np
import pyssg
data = np.fromfile("/path/to/sift_base.fvecs", dtype=np.float32)
dim = data[0].view(np.int32)
data = data.reshape(-1, dim + 1)
data = np.ascontiguousarray(data[:, 1:])
ndata, dim = data.shape
pyssg.set_seed(1234)
index = pyssg.IndexSSG(dim, ndata)
index.load("/path/to/ssg", data)
k, l = 100, 300
query = np.random.randn(dim).astype(np.float32)
knn = index.search(query, k, l)
print(knn)
Please refer to tests/test_python_query.py
for a real-world example.
We use the following parameters to get the SSG index in Fig. 6 of our paper.
We use efanna_graph to build the kNN graph
- Tool: efanna_graph
- Parameters:
Dataset | K | L | iter | S | R |
---|---|---|---|---|---|
SIFT1M | 200 | 200 | 12 | 10 | 100 |
GIST1M | 400 | 400 | 12 | 15 | 100 |
Crawl | 400 | 420 | 12 | 15 | 100 |
GloVe-100 | 400 | 420 | 12 | 20 | 200 |
- Commands:
$ efanna_graph/tests/test_nndescent sift.fvecs sift_200nn.knng 200 200 12 10 100
$ efanna_graph/tests/test_nndescent gist.fvecs gist_400nn.knng 400 400 12 15 100
$ efanna_graph/tests/test_nndescent crawl.fvecs crawl_400nn.knng 400 420 12 15 100
$ efanna_graph/tests/test_nndescent glove-100.fvecs glove-100_400nn.knng 400 420 12 15 200
- Parameters:
Dataset | L | R | Angle |
---|---|---|---|
SIFT1M | 100 | 50 | 60 |
GIST1M | 500 | 70 | 60 |
Crawl | 500 | 40 | 60 |
GloVe-100 | 500 | 50 | 60 |
- Commands:
$ ./test_ssg_index sift.fvecs sift_200nn.knng 100 50 60 sift.ssg
$ ./test_ssg_index gist.fvecs gist_400nn.knng 500 70 60 gist.ssg
$ ./test_ssg_index crawl.fvecs crawl_400nn.knng 500 40 60 crawl.ssg
$ ./test_ssg_index glove-100.fvecs glove-100_400nn.knng 500 50 60 glove-100.ssg
search_L
: range fromsearch_K
to 2000random_seed
: 161803398
Here we provide our pre-built kNN graphs and SSG index files used in our papar's experiments.
Dataset | kNN Graph | SSG Index |
---|---|---|
SIFT1M | sift_200nn.knng | sift.ssg |
GIST1M | gist_400nn.knng | gist.ssg |
Crawl | crawl_400nn.knng | crawl.ssg |
GloVe-100 | glove_400nn.knng | glove.ssg |