NHQ: Native Hybrid Query Framework for Vector Similarity Search with Attribute Constraint

1. Introduction

Hybrid query processing aims to identify these objects with similar vectors to query object and satisfying the given attribute constraints [SIGMOD'20, VLDB'20, SIGMOD'21, KDD'21]; this pressing need is fueled by massive amount of emerging applications such as visual products search [VLDB'20, SIGMOD'21, Middleware'18], academic expert finding, movie recommendation, etc. Our paper entitled "NHQ: Native Hybrid Query Framework for Vector Similarity Search with Attribute Constraint" provides a Native Hybrid Query (NHQ) solution outperforms the state-of-the-art competitors (up to 315x faster under the same recall) on all experimental datasets.

This repo contains the code, dataset, optimal parameters, and other detailed information used for the experiments of our paper.

2. Compared methods

We compare our hybrid query methods with six existing ones that have been used in many high-tech companies.

ADBV (VLDB'2020) is a cost-based hybrid query method proposed by Alibaba. It implements PQ and linear scan for vector similarity search, thus forming four sub-plans; and the least cost one is selected for answering a hybrid query.
Milvus (SIGMOD'2021) adopts a partition-based approach regarding attribute; it divides the object set through frequently used attributes, and deploys ADBV on each subset.
Vearch (Middleware'2018) is a repo developed by Jingdong. It first recalls similar candidates with unstructured constraint on HNSW, and then performs attribute filtering on the candidates to yield the final results.
NGT (SISAP'2016) is a library for performing high-speed ANNS released by Yahoo Japan. We implemented the hybrid query processing that conducts attribute filtering atop the candidates recalled by NGT.
Faiss (IVFPQ, TPAMI'2011) is popular quantization-based vector similarity search library developed by Facebook. We implement its hybrid query processing based on IVFPQ and strategy A (Fig2 in our paper).
SPTAG (ACM MM'2012, CVPR'2012, TPAMI'2014) is a PG-based vector similarity search library released by Microsoft, and it answers a hybrid query on strategy B (Fig2 in our paper).
NHQ-NPG_nsw and NHQ-NPG_kgraph is our hybrid query methods based on NHQ framework and two NPGs.

3. Datasets

Our experiment involves nine publicly available real-world datasets and one in-house dataset. Among them, the nine public datasets are composed of high-dimensional feature vectors extracted from the unstructured information, and they do not originally contain attributes; at this time, they are used for the performance evaluation of PGs. There is no publicly available dataset thus far that contains both structured and unstructured information [VLDB'20]. Therefore, we generate corresponding set of attributes for each object in public datasets following [SIGMOD'21]. For example, we add attributes such as date, location, size, etc. to each image on SIFT1M to form an object with a feature vector and a set of attributes. For the in-house dataset, each object in it consisting of high-dimensional vector extracted from paper content as well as three structured attributes, i.e., affiliation, topic, publication. The following table summarizes their main information.

	object_num	feature-vector_dim	query_num	type	download(vector)	download (Attributes)	download (groundtruth)
SIFT1M	1000000	128	10000	Image+Attribute	sift.tar.gz(161MB)	sift_attribute.tar.gz	label_sift_groundtruth.ivecs
GIST1M	1000000	960	1000	Image+Attribute	gist.tar.gz(2.6GB)	gist_attribute.tar.gz	label_gist_groundtruth.ivecs
GloVe	1183514	100	10000	Text+Attribute	glove-100.tar.gz(424MB)	glove-100_attribute.tar.gz	label_glove_groundtruth.ivecs
Crawl	1989995	300	10000	Text+Attribute	crawl.tar.gz(1.7GB)	crawl_attribute.tar.gz	label_crawl_groundtruth.ivecs
Audio	53387	192	200	Audio+Attribute	audio.tar.gz(26MB)	audio_attribute.tar.gz	label_audio_groundtruth.ivecs
Msong	992272	420	200	Audio+Attribute	msong.tar.gz(1.4GB)	msong_attribute.tar.gz	label_msong_groundtruth.ivecs
Enron	94987	1369	200	Text+Attribute	enron.tar.gz(51MB)	enron_attribute.tar.gz	label_enron_groundtruth.ivecs
UQ-V	1000000	256	10000	Video+Attribute	uqv.tar.gz(800MB)	uqv_attribute.tar.gz	label_uqv_groundtruth.ivecs
Paper	2029997	200	10000	Text+Attribute	paper.tar.gz(1.41GB)	paper_attribute.tar.gz	label_paper_groundtruth.ivecs
BIGANN100M	100000000	128	10000	Image+Attribute	bigann100m.tar.gz(9.2GB)	bigann100m_attribute.tar.gz	label_bigann100m_groundtruth.ivecs

Note that, all original objects and query object are converted to fvecs format, and groundtruth data is converted to ivecs format. Please refer here for the description of fvecs and ivecs format.

4. Parameters

Because parameters' adjustment in the entire object set may cause overfitting, we randomly sample a certain percentage of data points from the original object set to form a validation set. We search for the optimal value of all the adjustable parameters of each algorithm on each validation set, to make the algorithms' search performance reach the optimal level. See the parameters page for more information.

5. Installation and Usage

Prerequistes

GCC 4.9+ wih OpenMP
CMake 2.8+
Boost 1.55+

Installation

You can run the build.sh script to install all algorithms, including NPG_kgraph, NPG_nsw, NHQ-NPG_kgraph and NHQ-NPG_nsw.

Usage

After performing the installation, you can test each algorithm via the script test_hybrid_query.py or test_npg.py, and the specific parameter values can be found in parameters.

Validation of NHQ framework

To verify the capabilities of NHQ, we implement the hybrid queries additionally based on the "first vector similarity search, and then attribute filtering" strategy across NPG_kgraph and NPG_nsw. Thus, for hybrid queries on NHQ, you can test it by setting the following options in the script test_hybrid_query.py:

<index>: 'NHQ-NPG_kgraph' or 'NHQ-NPG_nsw'	# build and search

For the first vector search then label filtering, you can test it by setting the following options in test_hybrid_query.py:

<index>: 'NPG_kgraph' or 'NPG_nsw'	# build and search

NPG's Performance

To verify the effectiveness of the proposed edge selection and routing strategies, you can evaluate NPG algorithms (i.e., NPG_kgraph and NPG_nsw) by setting the following options in test_npg.py:

<index>: 'NPG_kgraph' or 'NPG_nsw'	# build and search

Evaluation of Hybrid Query Methods

You can test it by setting the following options in the script test_hybrid_query.py:

<index>: 'NHQ-NPG_kgraph' or 'NHQ-NPG_nsw'	# build and search

Parameter Sensitivity

For NHQ, $\omega _{x}$ and $\omega _{y}$ in Equation 5 of our paper are a pair of parameters that regulate the weights of $\delta$ and $\chi$, where $\delta$ is the distance between feature vectors, and $\chi$ is the distance between attribute vectors. Thus, $\omega _{x}$ and $\omega _{y}$ will impact the hybrid query performance. You can set different $\omega _{x}$ and $\omega _{y}$ by modify the following code in /NHQ-NPG_kgraph/src/index_graph.cpp.

  void IndexGraph::fusion_distance(float &dist, float &cnt)
  {
    // dist = cnt * dist * 2 / (cnt + dist); //w_x=cnt/(dist + cnt), w_y=dist/(dist + cnt)
    // cnt *= 100; // w_x=cnt*/(dist + cnt*), w_y=dist/(dist + cnt*)
    // if(cnt == 0) cnt = 1;
    // dist = cnt * dist * 2/(cnt + dist);
    // dist = dist / 521675 + float(cnt) / 3.0;  // w_x=1/dist_max, w_y=1/cnt_max
    dist += dist * cnt / (float)attribute_number_;  //w_x=1, w_y=dist/cnt_max
    // dist += 10000 * cnt; //w_x/w_y=c, and c is a constant
  }

6. Acknowledgements

The implementation of NPG_kgraph is taken from kgraph, ssg, nns_benchmark, and the implementation of NPG_nsw is taken from n2. Many thanks to them for inspiration. Thanks to everyone who provided references for this project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NHQ: Native Hybrid Query Framework for Vector Similarity Search with Attribute Constraint

1. Introduction

2. Compared methods

3. Datasets

4. Parameters

5. Installation and Usage

6. Acknowledgements

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
NHQ-NPG_kgraph		NHQ-NPG_kgraph
NHQ-NPG_nsw		NHQ-NPG_nsw
NPG_kgraph		NPG_kgraph
NPG_nsw/n2		NPG_nsw/n2
parameters		parameters
.gitignore		.gitignore
README.md		README.md
build.sh		build.sh
test_hybrid_query.py		test_hybrid_query.py
test_npg.py		test_npg.py

YujianFu97/NHQ

Folders and files

Latest commit

History

Repository files navigation

NHQ: Native Hybrid Query Framework for Vector Similarity Search with Attribute Constraint

1. Introduction

2. Compared methods

3. Datasets

4. Parameters

5. Installation and Usage

6. Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages