Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Excluding vectors from search #40

Closed
iNDicat0r opened this issue Mar 14, 2017 · 5 comments
Closed

Feature: Excluding vectors from search #40

iNDicat0r opened this issue Mar 14, 2017 · 5 comments
Labels

Comments

@iNDicat0r
Copy link

Imagine a set of 1000 vectors, each vector is linked to an entity(images in my case) and lets say i perform a knn lookup with k = 5, now i would like to examine the scenario of dynamically excluding vectors based on some criteria.

Is this even possible?

@iNDicat0r
Copy link
Author

An example would be:
12 cars, 4 red(2auto, 2 manual), 4 blue(2auto, 2 manual), 4 green (2auto, 2 manual).

My query vector is actually a manual car but i would like to exclude the red ones from search.

@mdouze
Copy link
Contributor

mdouze commented Mar 14, 2017

Hi!

First, I think it is misleading to use semantic analogies. Faiss is not a SQL database engine with symbolic representations like red / manual

There is no easy way of restricting elements from the search. The reason is because Faiss mainly relies on scanning strings of codes and computing distances. During the scan, it accesses the corresponding ID only if the code corresponds to a potentially interesting item. Skipping over some of the codes would require to do some ID-dependent operation, which slows the processing down.

The most sensible thing to do is to query Faiss with a larger k than needed and filter out the irrelevant results post-hoc.

@mdouze mdouze closed this as completed Mar 14, 2017
@jegou
Copy link
Contributor

jegou commented Mar 14, 2017

@iNDicat0r: this is currently not implemented, since this would probably require some specific callback function to test whether an entry satisfies the external constraints, which could be of any kind and therefore out of the scope of a core library.

We already have encountered this problem, and discussed about two possible turn-around strategies:

  • requesting a larger k' (>>k), and then filtering the short-list of size k. As mentioned Matthijs, this should not affect too much the speed if the requested k' << N, since our heap is quite efficient.
  • you can create multiple indexes, each per condition. Obviously this can work only for specific conditions. A typical example for which this works is when you want to filter based on the date (you just need to create an index per day).

@iNDicat0r
Copy link
Author

Gentlemen thanks for the quick responses. I completely understand that we are not talking about b-trees and sql like searches here, however there are scenarios where accuracy might be more important than efficiency. In any case the multiple index approach sounds applicable in my case(doing a classification first and use the classification label as an index???!!)

@qinjian623
Copy link

qinjian623 commented Feb 1, 2021

Same scenario like @iNDicat0r .
We got different attributes from data.
With pre-building multiple indexes, it may save more GPU memo.

Otherwise, we need build [ number of attributes * categories of each attribute ] indexes.

For example,
Car with attributes:

  1. brand = {Toyota, Ford ...}
  2. color = {Blue, White ...}
  3. type = {truck, sedan, suv ...}

Sometimes, we only want all Toyotas, or all blue ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants