Skip to content

YujianFu97/NHQ

Repository files navigation

NHQ: Native Hybrid Query Framework for Vector Similarity Search with Attribute Constraint

1. Introduction

Hybrid query processing aims to identify these objects with similar vectors to query object and satisfying the given attribute constraints [SIGMOD'20, VLDB'20, SIGMOD'21, KDD'21]; this pressing need is fueled by massive amount of emerging applications such as visual products search [VLDB'20, SIGMOD'21, Middleware'18], academic expert finding, movie recommendation, etc. Our paper entitled "NHQ: Native Hybrid Query Framework for Vector Similarity Search with Attribute Constraint" provides a Native Hybrid Query (NHQ) solution outperforms the state-of-the-art competitors (up to 315x faster under the same recall) on all experimental datasets.

This repo contains the code, dataset, optimal parameters, and other detailed information used for the experiments of our paper.

2. Compared methods

We compare our hybrid query methods with six existing ones that have been used in many high-tech companies.

  • ADBV (VLDB'2020) is a cost-based hybrid query method proposed by Alibaba. It implements PQ and linear scan for vector similarity search, thus forming four sub-plans; and the least cost one is selected for answering a hybrid query.
  • Milvus (SIGMOD'2021) adopts a partition-based approach regarding attribute; it divides the object set through frequently used attributes, and deploys ADBV on each subset.
  • Vearch (Middleware'2018) is a repo developed by Jingdong. It first recalls similar candidates with unstructured constraint on HNSW, and then performs attribute filtering on the candidates to yield the final results.
  • NGT (SISAP'2016) is a library for performing high-speed ANNS released by Yahoo Japan. We implemented the hybrid query processing that conducts attribute filtering atop the candidates recalled by NGT.
  • Faiss (IVFPQ, TPAMI'2011) is popular quantization-based vector similarity search library developed by Facebook. We implement its hybrid query processing based on IVFPQ and strategy A (Fig2 in our paper).
  • SPTAG (ACM MM'2012, CVPR'2012, TPAMI'2014) is a PG-based vector similarity search library released by Microsoft, and it answers a hybrid query on strategy B (Fig2 in our paper).
  • NHQ-NPG_nsw and NHQ-NPG_kgraph is our hybrid query methods based on NHQ framework and two NPGs.

3. Datasets

Our experiment involves nine publicly available real-world datasets and one in-house dataset. Among them, the nine public datasets are composed of high-dimensional feature vectors extracted from the unstructured information, and they do not originally contain attributes; at this time, they are used for the performance evaluation of PGs. There is no publicly available dataset thus far that contains both structured and unstructured information [VLDB'20]. Therefore, we generate corresponding set of attributes for each object in public datasets following [SIGMOD'21]. For example, we add attributes such as date, location, size, etc. to each image on SIFT1M to form an object with a feature vector and a set of attributes. For the in-house dataset, each object in it consisting of high-dimensional vector extracted from paper content as well as three structured attributes, i.e., affiliation, topic, publication. The following table summarizes their main information.

object_num feature-vector_dim query_num type download(vector) download (Attributes) download (groundtruth)
SIFT1M 1000000 128 10000 Image+Attribute sift.tar.gz(161MB) sift_attribute.tar.gz label_sift_groundtruth.ivecs
GIST1M 1000000 960 1000 Image+Attribute gist.tar.gz(2.6GB) gist_attribute.tar.gz label_gist_groundtruth.ivecs
GloVe 1183514 100 10000 Text+Attribute glove-100.tar.gz(424MB) glove-100_attribute.tar.gz label_glove_groundtruth.ivecs
Crawl 1989995 300 10000 Text+Attribute crawl.tar.gz(1.7GB) crawl_attribute.tar.gz label_crawl_groundtruth.ivecs
Audio 53387 192 200 Audio+Attribute audio.tar.gz(26MB) audio_attribute.tar.gz label_audio_groundtruth.ivecs
Msong 992272 420 200 Audio+Attribute msong.tar.gz(1.4GB) msong_attribute.tar.gz label_msong_groundtruth.ivecs
Enron 94987 1369 200 Text+Attribute enron.tar.gz(51MB) enron_attribute.tar.gz label_enron_groundtruth.ivecs
UQ-V 1000000 256 10000 Video+Attribute uqv.tar.gz(800MB) uqv_attribute.tar.gz label_uqv_groundtruth.ivecs
Paper 2029997 200 10000 Text+Attribute paper.tar.gz(1.41GB) paper_attribute.tar.gz label_paper_groundtruth.ivecs
BIGANN100M 100000000 128 10000 Image+Attribute bigann100m.tar.gz(9.2GB) bigann100m_attribute.tar.gz label_bigann100m_groundtruth.ivecs

Note that, all original objects and query object are converted to fvecs format, and groundtruth data is converted to ivecs format. Please refer here for the description of fvecs and ivecs format.

4. Parameters

Because parameters' adjustment in the entire object set may cause overfitting, we randomly sample a certain percentage of data points from the original object set to form a validation set. We search for the optimal value of all the adjustable parameters of each algorithm on each validation set, to make the algorithms' search performance reach the optimal level. See the parameters page for more information.

5. Installation and Usage

Prerequistes

GCC 4.9+ wih OpenMP
CMake 2.8+
Boost 1.55+

Installation

You can run the build.sh script to install all algorithms, including NPG_kgraph, NPG_nsw, NHQ-NPG_kgraph and NHQ-NPG_nsw.

Usage

After performing the installation, you can test each algorithm via the script test_hybrid_query.py or test_npg.py, and the specific parameter values can be found in parameters.

Validation of NHQ framework

To verify the capabilities of NHQ, we implement the hybrid queries additionally based on the "first vector similarity search, and then attribute filtering" strategy across NPG_kgraph and NPG_nsw. Thus, for hybrid queries on NHQ, you can test it by setting the following options in the script test_hybrid_query.py:

<index>: 'NHQ-NPG_kgraph' or 'NHQ-NPG_nsw'	# build and search

For the first vector search then label filtering, you can test it by setting the following options in test_hybrid_query.py:

<index>: 'NPG_kgraph' or 'NPG_nsw'	# build and search

NPG's Performance

To verify the effectiveness of the proposed edge selection and routing strategies, you can evaluate NPG algorithms (i.e., NPG_kgraph and NPG_nsw) by setting the following options in test_npg.py:

<index>: 'NPG_kgraph' or 'NPG_nsw'	# build and search

Evaluation of Hybrid Query Methods

You can test it by setting the following options in the script test_hybrid_query.py:

<index>: 'NHQ-NPG_kgraph' or 'NHQ-NPG_nsw'	# build and search

Parameter Sensitivity

For NHQ, $\omega _{x}$​ and $\omega _{y}$​ in Equation 5 of our paper are a pair of parameters that regulate the weights of $\delta$​ and $\chi$​, where $\delta$​ is the distance between feature vectors, and $\chi$​ is the distance between attribute vectors. Thus, $\omega _{x}$​ and $\omega _{y}$​ will impact the hybrid query performance. You can set different $\omega _{x}$​ and $\omega _{y}$​​ by modify the following code in /NHQ-NPG_kgraph/src/index_graph.cpp.

  void IndexGraph::fusion_distance(float &dist, float &cnt)
  {
    // dist = cnt * dist * 2 / (cnt + dist); //w_x=cnt/(dist + cnt), w_y=dist/(dist + cnt)
    // cnt *= 100; // w_x=cnt*/(dist + cnt*), w_y=dist/(dist + cnt*)
    // if(cnt == 0) cnt = 1;
    // dist = cnt * dist * 2/(cnt + dist);
    // dist = dist / 521675 + float(cnt) / 3.0;  // w_x=1/dist_max, w_y=1/cnt_max
    dist += dist * cnt / (float)attribute_number_;  //w_x=1, w_y=dist/cnt_max
    // dist += 10000 * cnt; //w_x/w_y=c, and c is a constant
  }

6. Acknowledgements

The implementation of NPG_kgraph is taken from kgraph, ssg, nns_benchmark, and the implementation of NPG_nsw is taken from n2. Many thanks to them for inspiration. Thanks to everyone who provided references for this project.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages