Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify different between dot_product and cosine similarities #91260

Closed
Tracked by #84324
jtibshirani opened this issue Nov 2, 2022 · 4 comments
Closed
Tracked by #84324

Clarify different between dot_product and cosine similarities #91260

jtibshirani opened this issue Nov 2, 2022 · 4 comments
Labels
>docs General docs changes >enhancement :Search Relevance/Vectors Vector search Team:Docs Meta label for docs team Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@jtibshirani
Copy link
Contributor

jtibshirani commented Nov 2, 2022

Approximate kNN search supports two similarities that are really similar:

  • cosine accepts any vector and computes the cosine similarity between them
  • dot_product requires vectors to be of magnitude 1, and computes the cosine similarity between them

Our recommendation is to use dot_product if possible, since it avoids computing the vector magnitudes (they're always 1), making search significantly faster. It's a bit confusing to have two similarities for the same use case -- users often just choose cosine and get suboptimal performance.

Maybe we could update cosine to compute and store the vector magnitudes while indexing. We could also compute the query magnitude once per search. Then, we could just reuse the magnitudes during the similarity computation. We do this for non-indexed dense_vector fields and found it really improved performance (#46294). This would require changes to how Lucene indexes vectors, described here: apache/lucene#11228.

We could then either remove dot_product, or expand its purpose. (For example, maybe dot_product could accept vectors of any length, which is helpful in recommendations use cases? This would require research.)

@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Nov 2, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@jtibshirani
Copy link
Contributor Author

This would also help with the byte-sized vectors work (#89784). Currently if you use dot_product with element_type: byte, then we just assume all the vectors have the same magnitude, but don't enforce it. The score is also a bit strange, to account for the fact the vectors can have any length (it's 0.5 + (dot_product / (32768 * dims))). Everything works out more nicely if you use the cosine similarity.

@mayya-sharipova mayya-sharipova added the >docs General docs changes label Jun 12, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Docs Meta label for docs team label Jun 12, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-docs (Team:Docs)

@benwtrent
Copy link
Member

Docs & default behavior have been significantly improved since this issue was opened. Closing

@javanna javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>docs General docs changes >enhancement :Search Relevance/Vectors Vector search Team:Docs Meta label for docs team Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

5 participants