Skip to content

Commit

Permalink
Add run_batch method to all nodes and Pipeline to allow batch que…
Browse files Browse the repository at this point in the history
…rying (#2481)

* Add run_batch methods for batch querying

* Update Documentation & Code Style

* Fix mypy

* Update Documentation & Code Style

* Fix mypy

* Fix linter

* Fix tests

* Update Documentation & Code Style

* Fix tests

* Update Documentation & Code Style

* Fix mypy

* Fix rest api test

* Update Documentation & Code Style

* Add Doc strings

* Update Documentation & Code Style

* Add batch_size as attribute to nodes supporting batching

* Adapt error messages

* Adapt type of filters in retrievers

* Revert change about truncation_warning in summarizer

* Unify multiple_doc_lists tests

* Use smaller models in extractor tests

* Add return types to JoinAnswers and RouteDocuments

* Adapt return statements in reader's run_batch method

* Allow list of filters

* Adapt error messages

* Update Documentation & Code Style

* Fix tests

* Fix mypy

* Adapt print_questions

* Remove disabling warning about too many public methods

* Add flag for pylint to disable warning about too many public methods in pipelines/base.py and document_stores/base.py

* Add type check

* Update Documentation & Code Style

* Adapt tutorial 11

* Update Documentation & Code Style

* Add query_batch method for DCDocStore

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
  • Loading branch information
bogdankostic and github-actions[bot] committed May 11, 2022
1 parent 5378a9a commit 738e008
Show file tree
Hide file tree
Showing 68 changed files with 4,843 additions and 362 deletions.
28 changes: 25 additions & 3 deletions docs/_src/api/api/document_classifier.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ With this document_classifier, you can directly get predictions via predict()
#### TransformersDocumentClassifier.\_\_init\_\_

```python
def __init__(model_name_or_path: str = "bhadresh-savani/distilbert-base-uncased-emotion", model_version: Optional[str] = None, tokenizer: Optional[str] = None, use_gpu: bool = True, return_all_scores: bool = False, task: str = "text-classification", labels: Optional[List[str]] = None, batch_size: int = -1, classification_field: str = None)
def __init__(model_name_or_path: str = "bhadresh-savani/distilbert-base-uncased-emotion", model_version: Optional[str] = None, tokenizer: Optional[str] = None, use_gpu: bool = True, return_all_scores: bool = False, task: str = "text-classification", labels: Optional[List[str]] = None, batch_size: Optional[int] = None, classification_field: str = None)
```

Load a text classification model from Transformers.
Expand Down Expand Up @@ -114,15 +114,15 @@ See https://huggingface.co/models for full list of available models.
["positive", "negative"] otherwise None. Given a LABEL, the sequence fed to the model is "<cls> sequence to
classify <sep> This example is LABEL . <sep>" and the model predicts whether that sequence is a contradiction
or an entailment.
- `batch_size`: batch size to be processed at once
- `batch_size`: Number of Documents to be processed at a time.
- `classification_field`: Name of Document's meta field to be used for classification. If left unset, Document.content is used by default.

<a id="transformers.TransformersDocumentClassifier.predict"></a>

#### TransformersDocumentClassifier.predict

```python
def predict(documents: List[Document]) -> List[Document]
def predict(documents: List[Document], batch_size: Optional[int] = None) -> List[Document]
```

Returns documents containing classification result in meta field.
Expand All @@ -132,8 +132,30 @@ Documents are updated in place.
**Arguments**:

- `documents`: List of Document to classify
- `batch_size`: Number of Documents to classify at a time.

**Returns**:

List of Document enriched with meta information

<a id="transformers.TransformersDocumentClassifier.predict_batch"></a>

#### TransformersDocumentClassifier.predict\_batch

```python
def predict_batch(documents: Union[List[Document], List[List[Document]]], batch_size: Optional[int] = None) -> Union[List[Document], List[List[Document]]]
```

Returns documents containing classification result in meta field.

Documents are updated in place.

**Arguments**:

- `documents`: List of Documents or list of lists of Documents to classify.
- `batch_size`: Number of Documents to classify at a time.

**Returns**:

List of Documents or list of lists of Documents enriched with meta information.

214 changes: 204 additions & 10 deletions docs/_src/api/api/document_store.md
Original file line number Diff line number Diff line change
Expand Up @@ -289,6 +289,16 @@ is therefore only an interim solution until the run function also accepts docume
If None, the DocumentStore's default index (self.index) will be used.
- `id_hash_keys`: List of the fields that the hashes of the ids are generated from.

<a id="base.BaseDocumentStore.describe_documents"></a>

#### BaseDocumentStore.describe\_documents

```python
def describe_documents(index=None)
```

Return a summary of the documents in the document store

<a id="base.KeywordDocumentStore"></a>

## KeywordDocumentStore
Expand Down Expand Up @@ -390,6 +400,100 @@ Defaults to False.
If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant.
Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.

<a id="base.KeywordDocumentStore.query_batch"></a>

#### KeywordDocumentStore.query\_batch

```python
@abstractmethod
def query_batch(queries: Union[str, List[str]], filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None, top_k: int = 10, custom_query: Optional[str] = None, index: Optional[str] = None, headers: Optional[Dict[str, str]] = None, all_terms_must_match: bool = False, scale_score: bool = True) -> Union[List[Document], List[List[Document]]]
```

Scan through documents in DocumentStore and return a small number documents

that are most relevant to the provided queries as defined by keyword matching algorithms like BM25.

This method lets you find relevant documents for a single query string (output: List of Documents), or a
a list of query strings (output: List of Lists of Documents).

**Arguments**:

- `queries`: Single query or list of queries.
- `filters`: Optional filters to narrow down the search space to documents whose metadata fulfill certain
conditions.
Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
`"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
Logical operator keys take a dictionary of metadata field names and/or logical operators as
value. Metadata field names take a dictionary of comparison operators as value. Comparison
operator keys take a single value or (in case of `"$in"`) a list of values as value.
If no logical operator is provided, `"$and"` is used as default operation. If no comparison
operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
operation.

__Example__:
```python
filters = {
"$and": {
"type": {"$eq": "article"},
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": {"$in": ["economy", "politics"]},
"publisher": {"$eq": "nytimes"}
}
}
}
# or simpler using default operators
filters = {
"type": "article",
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": ["economy", "politics"],
"publisher": "nytimes"
}
}
```

To use the same logical operator multiple times on the same level, logical operators take
optionally a list of dictionaries as value.

__Example__:
```python
filters = {
"$or": [
{
"$and": {
"Type": "News Paper",
"Date": {
"$lt": "2019-01-01"
}
}
},
{
"$and": {
"Type": "Blog Post",
"Date": {
"$gte": "2019-01-01"
}
}
}
]
}
```
- `top_k`: How many documents to return per query.
- `custom_query`: Custom query to be executed.
- `index`: The name of the index in the DocumentStore from which to retrieve documents
- `headers`: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
- `all_terms_must_match`: Whether all terms of the query must match the document.
If true all query terms must be present in a document in order to be retrieved (i.e the AND operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy AND fish AND restaurant").
Otherwise at least one query term must be present in a document in order to be retrieved (i.e the OR operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy OR fish OR restaurant").
Defaults to False.
- `scale_score`: Whether to scale the similarity score to the unit interval (range of [0,1]).
If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant.
Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.

<a id="base.get_batches_from_generator"></a>

#### get\_batches\_from\_generator
Expand Down Expand Up @@ -918,6 +1022,106 @@ Defaults to false.
If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant.
Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.

<a id="elasticsearch.ElasticsearchDocumentStore.query_batch"></a>

#### ElasticsearchDocumentStore.query\_batch

```python
def query_batch(queries: Union[str, List[str]], filters: Optional[
Union[
Dict[str, Union[Dict, List, str, int, float, bool]],
List[Dict[str, Union[Dict, List, str, int, float, bool]]],
]
] = None, top_k: int = 10, custom_query: Optional[str] = None, index: Optional[str] = None, headers: Optional[Dict[str, str]] = None, all_terms_must_match: bool = False, scale_score: bool = True) -> Union[List[Document], List[List[Document]]]
```

Scan through documents in DocumentStore and return a small number documents

that are most relevant to the provided queries as defined by keyword matching algorithms like BM25.

This method lets you find relevant documents for a single query string (output: List of Documents), or a
a list of query strings (output: List of Lists of Documents).

**Arguments**:

- `queries`: Single query or list of queries.
- `filters`: Optional filters to narrow down the search space to documents whose metadata fulfill certain
conditions. Can be a single filter that will be applied to each query or a list of filters
(one filter per query).

Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
`"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
Logical operator keys take a dictionary of metadata field names and/or logical operators as
value. Metadata field names take a dictionary of comparison operators as value. Comparison
operator keys take a single value or (in case of `"$in"`) a list of values as value.
If no logical operator is provided, `"$and"` is used as default operation. If no comparison
operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
operation.

__Example__:
```python
filters = {
"$and": {
"type": {"$eq": "article"},
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": {"$in": ["economy", "politics"]},
"publisher": {"$eq": "nytimes"}
}
}
}
# or simpler using default operators
filters = {
"type": "article",
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": ["economy", "politics"],
"publisher": "nytimes"
}
}
```

To use the same logical operator multiple times on the same level, logical operators take
optionally a list of dictionaries as value.

__Example__:
```python
filters = {
"$or": [
{
"$and": {
"Type": "News Paper",
"Date": {
"$lt": "2019-01-01"
}
}
},
{
"$and": {
"Type": "Blog Post",
"Date": {
"$gte": "2019-01-01"
}
}
}
]
}
```
- `top_k`: How many documents to return per query.
- `custom_query`: Custom query to be executed.
- `index`: The name of the index in the DocumentStore from which to retrieve documents
- `headers`: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
- `all_terms_must_match`: Whether all terms of the query must match the document.
If true all query terms must be present in a document in order to be retrieved (i.e the AND operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy AND fish AND restaurant").
Otherwise at least one query term must be present in a document in order to be retrieved (i.e the OR operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy OR fish OR restaurant").
Defaults to False.
- `scale_score`: Whether to scale the similarity score to the unit interval (range of [0,1]).
If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant.
Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.

<a id="elasticsearch.ElasticsearchDocumentStore.query_by_embedding"></a>

#### ElasticsearchDocumentStore.query\_by\_embedding
Expand Down Expand Up @@ -1003,16 +1207,6 @@ Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-c
If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant.
Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.

<a id="elasticsearch.ElasticsearchDocumentStore.describe_documents"></a>

#### ElasticsearchDocumentStore.describe\_documents

```python
def describe_documents(index=None)
```

Return a summary of the documents in the document store

<a id="elasticsearch.ElasticsearchDocumentStore.update_embeddings"></a>

#### ElasticsearchDocumentStore.update\_embeddings
Expand Down
15 changes: 15 additions & 0 deletions docs/_src/api/api/extractor.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,21 @@ def extract(text)

This function can be called to perform entity extraction when using the node in isolation.

<a id="entity.EntityExtractor.extract_batch"></a>

#### EntityExtractor.extract\_batch

```python
def extract_batch(texts: Union[List[str], List[List[str]]], batch_size: Optional[int] = None)
```

This function allows to extract entities out of a list of strings or a list of lists of strings.

**Arguments**:

- `texts`: List of str or list of lists of str to extract entities from.
- `batch_size`: Number of texts to make predictions on at a time.

<a id="entity.simplify_ner_for_qa"></a>

#### simplify\_ner\_for\_qa
Expand Down
50 changes: 49 additions & 1 deletion docs/_src/api/api/generator.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,54 @@ Abstract method to generate answers.

Generated answers plus additional infos in a dict

<a id="base.BaseGenerator.predict_batch"></a>

#### BaseGenerator.predict\_batch

```python
def predict_batch(queries: Union[str, List[str]], documents: Union[List[Document], List[List[Document]]], top_k: Optional[int] = None, batch_size: Optional[int] = None)
```

Generate the answer to the input queries. The generation will be conditioned on the supplied documents.

These documents can for example be retrieved via the Retriever.

- If you provide a single query...

- ... and a single list of Documents, the query will be applied to each Document individually.
- ... and a list of lists of Documents, the query will be applied to each list of Documents and the Answers
will be aggregated per Document list.

- If you provide a list of queries...

- ... and a single list of Documents, each query will be applied to each Document individually.
- ... and a list of lists of Documents, each query will be applied to its corresponding list of Documents
and the Answers will be aggregated per query-Document pair.

**Arguments**:

- `queries`: Single query or list of queries.
- `documents`: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
Can be a single list of Documents or a list of lists of Documents.
- `top_k`: Number of returned answers per query.
- `batch_size`: Not applicable.

**Returns**:

Generated answers plus additional infos in a dict like this:
```python
| {'queries': 'who got the first nobel prize in physics',
| 'answers':
| [{'query': 'who got the first nobel prize in physics',
| 'answer': ' albert einstein',
| 'meta': { 'doc_ids': [...],
| 'doc_scores': [80.42758 ...],
| 'doc_probabilities': [40.71379089355469, ...
| 'content': ['Albert Einstein was a ...]
| 'titles': ['"Albert Einstein"', ...]
| }}]}
```

<a id="transformers"></a>

# Module transformers
Expand Down Expand Up @@ -123,7 +171,7 @@ def predict(query: str, documents: List[Document], top_k: Optional[int] = None)

Generate the answer to the input query. The generation will be conditioned on the supplied documents.

These document can for example be retrieved via the Retriever.
These documents can for example be retrieved via the Retriever.

**Arguments**:

Expand Down
Loading

0 comments on commit 738e008

Please sign in to comment.