Add run_batch method to all nodes and Pipeline to allow batch que…

…rying (#2481) * Add run_batch methods for batch querying * Update Documentation & Code Style * Fix mypy * Update Documentation & Code Style * Fix mypy * Fix linter * Fix tests * Update Documentation & Code Style * Fix tests * Update Documentation & Code Style * Fix mypy * Fix rest api test * Update Documentation & Code Style * Add Doc strings * Update Documentation & Code Style * Add batch_size as attribute to nodes supporting batching * Adapt error messages * Adapt type of filters in retrievers * Revert change about truncation_warning in summarizer * Unify multiple_doc_lists tests * Use smaller models in extractor tests * Add return types to JoinAnswers and RouteDocuments * Adapt return statements in reader's run_batch method * Allow list of filters * Adapt error messages * Update Documentation & Code Style * Fix tests * Fix mypy * Adapt print_questions * Remove disabling warning about too many public methods * Add flag for pylint to disable warning about too many public methods in pipelines/base.py and document_stores/base.py * Add type check * Update Documentation & Code Style * Adapt tutorial 11 * Update Documentation & Code Style * Add query_batch method for DCDocStore * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
deepset-ai · May 11, 2022 · 738e008 · 738e008
1 parent 5378a9a
commit 738e008
Show file tree

Hide file tree

Showing 68 changed files with 4,843 additions and 362 deletions.
diff --git a/docs/_src/api/api/document_classifier.md b/docs/_src/api/api/document_classifier.md
@@ -84,7 +84,7 @@ With this document_classifier, you can directly get predictions via predict()
 #### TransformersDocumentClassifier.\_\_init\_\_
 
 ```python
-def __init__(model_name_or_path: str = "bhadresh-savani/distilbert-base-uncased-emotion", model_version: Optional[str] = None, tokenizer: Optional[str] = None, use_gpu: bool = True, return_all_scores: bool = False, task: str = "text-classification", labels: Optional[List[str]] = None, batch_size: int = -1, classification_field: str = None)
+def __init__(model_name_or_path: str = "bhadresh-savani/distilbert-base-uncased-emotion", model_version: Optional[str] = None, tokenizer: Optional[str] = None, use_gpu: bool = True, return_all_scores: bool = False, task: str = "text-classification", labels: Optional[List[str]] = None, batch_size: Optional[int] = None, classification_field: str = None)
 ```
 
 Load a text classification model from Transformers.
@@ -114,15 +114,15 @@ See https://huggingface.co/models for full list of available models.
 ["positive", "negative"] otherwise None. Given a LABEL, the sequence fed to the model is "<cls> sequence to
 classify <sep> This example is LABEL . <sep>" and the model predicts whether that sequence is a contradiction
 or an entailment.
-- `batch_size`: batch size to be processed at once
+- `batch_size`: Number of Documents to be processed at a time.
 - `classification_field`: Name of Document's meta field to be used for classification. If left unset, Document.content is used by default.
 
 <a id="transformers.TransformersDocumentClassifier.predict"></a>
 
 #### TransformersDocumentClassifier.predict
 
 ```python
-def predict(documents: List[Document]) -> List[Document]
+def predict(documents: List[Document], batch_size: Optional[int] = None) -> List[Document]
 ```
 
 Returns documents containing classification result in meta field.
@@ -132,8 +132,30 @@ Documents are updated in place.
 **Arguments**:
 
 - `documents`: List of Document to classify
+- `batch_size`: Number of Documents to classify at a time.
 
 **Returns**:
 
 List of Document enriched with meta information
 
+<a id="transformers.TransformersDocumentClassifier.predict_batch"></a>
+
+#### TransformersDocumentClassifier.predict\_batch
+
+```python
+def predict_batch(documents: Union[List[Document], List[List[Document]]], batch_size: Optional[int] = None) -> Union[List[Document], List[List[Document]]]
+```
+
+Returns documents containing classification result in meta field.
+
+Documents are updated in place.
+
+**Arguments**:
+
+- `documents`: List of Documents or list of lists of Documents to classify.
+- `batch_size`: Number of Documents to classify at a time.
+
+**Returns**:
+
+List of Documents or list of lists of Documents enriched with meta information.
+
diff --git a/docs/_src/api/api/document_store.md b/docs/_src/api/api/document_store.md
@@ -289,6 +289,16 @@ is therefore only an interim solution until the run function also accepts docume
 If None, the DocumentStore's default index (self.index) will be used.
 - `id_hash_keys`: List of the fields that the hashes of the ids are generated from.
 
+<a id="base.BaseDocumentStore.describe_documents"></a>
+
+#### BaseDocumentStore.describe\_documents
+
+```python
+def describe_documents(index=None)
+```
+
+Return a summary of the documents in the document store
+
 <a id="base.KeywordDocumentStore"></a>
 
 ## KeywordDocumentStore
@@ -390,6 +400,100 @@ Defaults to False.
 If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant.
 Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
 
+<a id="base.KeywordDocumentStore.query_batch"></a>
+
+#### KeywordDocumentStore.query\_batch
+
+```python
+@abstractmethod
+def query_batch(queries: Union[str, List[str]], filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None, top_k: int = 10, custom_query: Optional[str] = None, index: Optional[str] = None, headers: Optional[Dict[str, str]] = None, all_terms_must_match: bool = False, scale_score: bool = True) -> Union[List[Document], List[List[Document]]]
+```
+
+Scan through documents in DocumentStore and return a small number documents
+
+that are most relevant to the provided queries as defined by keyword matching algorithms like BM25.
+
+This method lets you find relevant documents for a single query string (output: List of Documents), or a
+a list of query strings (output: List of Lists of Documents).
+
+**Arguments**:
+
+- `queries`: Single query or list of queries.
+- `filters`: Optional filters to narrow down the search space to documents whose metadata fulfill certain
+conditions.
+Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
+operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
+`"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
+Logical operator keys take a dictionary of metadata field names and/or logical operators as
+value. Metadata field names take a dictionary of comparison operators as value. Comparison
+operator keys take a single value or (in case of `"$in"`) a list of values as value.
+If no logical operator is provided, `"$and"` is used as default operation. If no comparison
+operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
+operation.
+
+ __Example__:
+ ```python
+ filters = {
+ "$and": {
+ "type": {"$eq": "article"},
+ "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+ "rating": {"$gte": 3},
+ "$or": {
+ "genre": {"$in": ["economy", "politics"]},
+ "publisher": {"$eq": "nytimes"}
+ }
+ }
+ }
+ # or simpler using default operators
+ filters = {
+ "type": "article",
+ "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+ "rating": {"$gte": 3},
+ "$or": {
+ "genre": ["economy", "politics"],
+ "publisher": "nytimes"
+ }
+ }
+ ```
+
+ To use the same logical operator multiple times on the same level, logical operators take
+ optionally a list of dictionaries as value.
+
+ __Example__:
+ ```python
+ filters = {
+ "$or": [
+ {
+ "$and": {
+ "Type": "News Paper",
+ "Date": {
+ "$lt": "2019-01-01"
+ }
+ }
+ },
+ {
+ "$and": {
+ "Type": "Blog Post",
+ "Date": {
+ "$gte": "2019-01-01"
+ }
+ }
+ }
+ ]
+ }
+ ```
+- `top_k`: How many documents to return per query.
+- `custom_query`: Custom query to be executed.
+- `index`: The name of the index in the DocumentStore from which to retrieve documents
+- `headers`: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
+- `all_terms_must_match`: Whether all terms of the query must match the document.
+If true all query terms must be present in a document in order to be retrieved (i.e the AND operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy AND fish AND restaurant").
+Otherwise at least one query term must be present in a document in order to be retrieved (i.e the OR operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy OR fish OR restaurant").
+Defaults to False.
+- `scale_score`: Whether to scale the similarity score to the unit interval (range of [0,1]).
+If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant.
+Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
+
 <a id="base.get_batches_from_generator"></a>
 
 #### get\_batches\_from\_generator
@@ -918,6 +1022,106 @@ Defaults to false.
 If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant.
 Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
 
+<a id="elasticsearch.ElasticsearchDocumentStore.query_batch"></a>
+
+#### ElasticsearchDocumentStore.query\_batch
+
+```python
+def query_batch(queries: Union[str, List[str]], filters: Optional[
+ Union[
+ Dict[str, Union[Dict, List, str, int, float, bool]],
+ List[Dict[str, Union[Dict, List, str, int, float, bool]]],
+ ]
+ ] = None, top_k: int = 10, custom_query: Optional[str] = None, index: Optional[str] = None, headers: Optional[Dict[str, str]] = None, all_terms_must_match: bool = False, scale_score: bool = True) -> Union[List[Document], List[List[Document]]]
+```
+
+Scan through documents in DocumentStore and return a small number documents
+
+that are most relevant to the provided queries as defined by keyword matching algorithms like BM25.
+
+This method lets you find relevant documents for a single query string (output: List of Documents), or a
+a list of query strings (output: List of Lists of Documents).
+
+**Arguments**:
+
+- `queries`: Single query or list of queries.
+- `filters`: Optional filters to narrow down the search space to documents whose metadata fulfill certain
+conditions. Can be a single filter that will be applied to each query or a list of filters
+(one filter per query).
+
+Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
+operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
+`"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
+Logical operator keys take a dictionary of metadata field names and/or logical operators as
+value. Metadata field names take a dictionary of comparison operators as value. Comparison
+operator keys take a single value or (in case of `"$in"`) a list of values as value.
+If no logical operator is provided, `"$and"` is used as default operation. If no comparison
+operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
+operation.
+
+ __Example__:
+ ```python
+ filters = {
+ "$and": {
+ "type": {"$eq": "article"},
+ "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+ "rating": {"$gte": 3},
+ "$or": {
+ "genre": {"$in": ["economy", "politics"]},
+ "publisher": {"$eq": "nytimes"}
+ }
+ }
+ }
+ # or simpler using default operators
+ filters = {
+ "type": "article",
+ "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+ "rating": {"$gte": 3},
+ "$or": {
+ "genre": ["economy", "politics"],
+ "publisher": "nytimes"
+ }
+ }
+ ```
+
+ To use the same logical operator multiple times on the same level, logical operators take
+ optionally a list of dictionaries as value.
+
+ __Example__:
+ ```python
+ filters = {
+ "$or": [
+ {
+ "$and": {
+ "Type": "News Paper",
+ "Date": {
+ "$lt": "2019-01-01"
+ }
+ }
+ },
+ {
+ "$and": {
+ "Type": "Blog Post",
+ "Date": {
+ "$gte": "2019-01-01"
+ }
+ }
+ }
+ ]
+ }
+ ```
+- `top_k`: How many documents to return per query.
+- `custom_query`: Custom query to be executed.
+- `index`: The name of the index in the DocumentStore from which to retrieve documents
+- `headers`: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
+- `all_terms_must_match`: Whether all terms of the query must match the document.
+If true all query terms must be present in a document in order to be retrieved (i.e the AND operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy AND fish AND restaurant").
+Otherwise at least one query term must be present in a document in order to be retrieved (i.e the OR operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy OR fish OR restaurant").
+Defaults to False.
+- `scale_score`: Whether to scale the similarity score to the unit interval (range of [0,1]).
+If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant.
+Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
+
 <a id="elasticsearch.ElasticsearchDocumentStore.query_by_embedding"></a>
 
 #### ElasticsearchDocumentStore.query\_by\_embedding
@@ -1003,16 +1207,6 @@ Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-c
 If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant.
 Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
 
-<a id="elasticsearch.ElasticsearchDocumentStore.describe_documents"></a>
-
-#### ElasticsearchDocumentStore.describe\_documents
-
-```python
-def describe_documents(index=None)
-```
-
-Return a summary of the documents in the document store
-
 <a id="elasticsearch.ElasticsearchDocumentStore.update_embeddings"></a>
 
 #### ElasticsearchDocumentStore.update\_embeddings

diff --git a/docs/_src/api/api/extractor.md b/docs/_src/api/api/extractor.md
@@ -37,6 +37,21 @@ def extract(text)
 
 This function can be called to perform entity extraction when using the node in isolation.
 
+<a id="entity.EntityExtractor.extract_batch"></a>
+
+#### EntityExtractor.extract\_batch
+
+```python
+def extract_batch(texts: Union[List[str], List[List[str]]], batch_size: Optional[int] = None)
+```
+
+This function allows to extract entities out of a list of strings or a list of lists of strings.
+
+**Arguments**:
+
+- `texts`: List of str or list of lists of str to extract entities from.
+- `batch_size`: Number of texts to make predictions on at a time.
+
 <a id="entity.simplify_ner_for_qa"></a>
 
 #### simplify\_ner\_for\_qa

diff --git a/docs/_src/api/api/generator.md b/docs/_src/api/api/generator.md
@@ -33,6 +33,54 @@ Abstract method to generate answers.
 
 Generated answers plus additional infos in a dict
 
+<a id="base.BaseGenerator.predict_batch"></a>
+
+#### BaseGenerator.predict\_batch
+
+```python
+def predict_batch(queries: Union[str, List[str]], documents: Union[List[Document], List[List[Document]]], top_k: Optional[int] = None, batch_size: Optional[int] = None)
+```
+
+Generate the answer to the input queries. The generation will be conditioned on the supplied documents.
+
+These documents can for example be retrieved via the Retriever.
+
+- If you provide a single query...
+
+ - ... and a single list of Documents, the query will be applied to each Document individually.
+ - ... and a list of lists of Documents, the query will be applied to each list of Documents and the Answers
+ will be aggregated per Document list.
+
+- If you provide a list of queries...
+
+ - ... and a single list of Documents, each query will be applied to each Document individually.
+ - ... and a list of lists of Documents, each query will be applied to its corresponding list of Documents
+ and the Answers will be aggregated per query-Document pair.
+
+**Arguments**:
+
+- `queries`: Single query or list of queries.
+- `documents`: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
+Can be a single list of Documents or a list of lists of Documents.
+- `top_k`: Number of returned answers per query.
+- `batch_size`: Not applicable.
+
+**Returns**:
+
+Generated answers plus additional infos in a dict like this:
+```python
+| {'queries': 'who got the first nobel prize in physics',
+| 'answers':
+| [{'query': 'who got the first nobel prize in physics',
+| 'answer': ' albert einstein',
+| 'meta': { 'doc_ids': [...],
+| 'doc_scores': [80.42758 ...],
+| 'doc_probabilities': [40.71379089355469, ...
+| 'content': ['Albert Einstein was a ...]
+| 'titles': ['"Albert Einstein"', ...]
+| }}]}
+```
+
 <a id="transformers"></a>
 
 # Module transformers
@@ -123,7 +171,7 @@ def predict(query: str, documents: List[Document], top_k: Optional[int] = None)
 
 Generate the answer to the input query. The generation will be conditioned on the supplied documents.
 
-These document can for example be retrieved via the Retriever.
+These documents can for example be retrieved via the Retriever.
 
 **Arguments**: