Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize similarity argument description #1684

Merged
merged 2 commits into from
Nov 2, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 3 additions & 4 deletions docs/_src/api/api/document_store.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,7 +200,7 @@ A DocumentStore using Elasticsearch to store and query the documents for our sea
If set to 'wait_for', continue only after changes are visible (slow, but safe).
If set to 'false', continue directly (fast, but sometimes unintuitive behaviour when docs are not immediately available after ingestion).
More info at https://www.elastic.co/guide/en/elasticsearch/reference/6.8/docs-refresh.html
- `similarity`: The similarity function used to compare document vectors. 'dot_product' is the default sine it is
- `similarity`: The similarity function used to compare document vectors. 'dot_product' is the default since it is
more performant with DPR embeddings. 'cosine' is recommended if you are using a Sentence BERT model.
- `timeout`: Number of seconds after which an ElasticSearch request times out.
- `return_embedding`: To return document embedding
Expand Down Expand Up @@ -1419,9 +1419,7 @@ As a rule of thumb, we would see a 30% ~ 50% increase in the search performance
Note that an overly large index_file_size value may cause failure to load a segment into the memory or graphics memory.
(From https://milvus.io/docs/v1.0.0/performance_faq.md#How-can-I-get-the-best-performance-from-Milvus-through-setting-index_file_size)
- `similarity`: The similarity function used to compare document vectors. 'dot_product' is the default and recommended for DPR embeddings.
'cosine' is recommended for Sentence Transformers, but is not directly supported by Milvus.
However, Haystack can normalize your embeddings and use `dot_product` to get the same results.
See https://milvus.io/docs/v1.0.0/metric.md?Inner-product-(IP)`floating`.
'cosine' is recommended for Sentence Transformers.
- `index_type`: Type of approximate nearest neighbour (ANN) index used. The choice here determines your tradeoff between speed and accuracy.
Some popular options:
- FLAT (default): Exact method, slow
Expand Down Expand Up @@ -1712,6 +1710,7 @@ The current implementation is not supporting the storage of labels, so you canno
If no Reader is used (e.g. in FAQ-Style QA) the plain content of this field will just be returned.
- `name_field`: Name of field that contains the title of the the doc
- `similarity`: The similarity function used to compare document vectors. 'dot_product' is the default.
'cosine' is recommended for Sentence Transformers.
- `index_type`: Index type of any vector object defined in weaviate schema. The vector index type is pluggable.
Currently, HSNW is only supported.
See: https://www.semi.technology/developers/weaviate/current/more-resources/performance.html
Expand Down
2 changes: 1 addition & 1 deletion haystack/document_stores/elasticsearch.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ def __init__(
If set to 'wait_for', continue only after changes are visible (slow, but safe).
If set to 'false', continue directly (fast, but sometimes unintuitive behaviour when docs are not immediately available after ingestion).
More info at https://www.elastic.co/guide/en/elasticsearch/reference/6.8/docs-refresh.html
:param similarity: The similarity function used to compare document vectors. 'dot_product' is the default sine it is
:param similarity: The similarity function used to compare document vectors. 'dot_product' is the default since it is
more performant with DPR embeddings. 'cosine' is recommended if you are using a Sentence BERT model.
:param timeout: Number of seconds after which an ElasticSearch request times out.
:param return_embedding: To return document embedding
Expand Down
7 changes: 3 additions & 4 deletions haystack/document_stores/milvus.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,9 +70,7 @@ def __init__(
Note that an overly large index_file_size value may cause failure to load a segment into the memory or graphics memory.
(From https://milvus.io/docs/v1.0.0/performance_faq.md#How-can-I-get-the-best-performance-from-Milvus-through-setting-index_file_size)
:param similarity: The similarity function used to compare document vectors. 'dot_product' is the default and recommended for DPR embeddings.
'cosine' is recommended for Sentence Transformers, but is not directly supported by Milvus.
However, Haystack can normalize your embeddings and use `dot_product` to get the same results.
See https://milvus.io/docs/v1.0.0/metric.md?Inner-product-(IP)#floating.
'cosine' is recommended for Sentence Transformers.
:param index_type: Type of approximate nearest neighbour (ANN) index used. The choice here determines your tradeoff between speed and accuracy.
Some popular options:
- FLAT (default): Exact method, slow
Expand Down Expand Up @@ -213,7 +211,8 @@ def write_documents(self, documents: Union[List[dict], List[Document]], index: O
for doc in document_batch:
doc_ids.append(doc.id)
if isinstance(doc.embedding, np.ndarray):
if self.similarity=="cosine": self.normalize_embedding(doc.embedding)
if self.similarity=="cosine":
self.normalize_embedding(doc.embedding)
embeddings.append(doc.embedding.tolist())
elif isinstance(doc.embedding, list):
if self.similarity=="cosine":
Expand Down
1 change: 1 addition & 0 deletions haystack/document_stores/weaviate.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ def __init__(
If no Reader is used (e.g. in FAQ-Style QA) the plain content of this field will just be returned.
:param name_field: Name of field that contains the title of the the doc
:param similarity: The similarity function used to compare document vectors. 'dot_product' is the default.
'cosine' is recommended for Sentence Transformers.
:param index_type: Index type of any vector object defined in weaviate schema. The vector index type is pluggable.
Currently, HSNW is only supported.
See: https://www.semi.technology/developers/weaviate/current/more-resources/performance.html
Expand Down