Skip to content

Commit

Permalink
Add isolated node eval mode in pipeline eval (deepset-ai#1962)
Browse files Browse the repository at this point in the history
* run predictions on ground-truth docs in reader

* build dataframe for closed/open domain eval

* fix looping through multilabel

* fix looping through multilabel's list of labels

* simplify collecting relevant docs

* switch closed-domain eval off by default

* Add latest docstring and tutorial changes

* handle edge case params not given

* renaming & generate pipeline eval report

* add test case for closed-domain eval metrics

* Add latest docstring and tutorial changes

* test  report of closed-domain eval

* report closed-domain metrics only for answer metrics not doc metrics

* refactoring

* fix mypy & remove comment

* add second for-loop & use answer as method input

* renaming & add separate loop building docs eval df

* Add latest docstring and tutorial changes

* source /home/tstad/miniconda3/bin/activatechange column order for evaluatation dataframe (deepset-ai#1957)
conda activate haystack-dev2

* change column order for evaluatation dataframe

* added missing eval column node_input

* generic order for both document and answer returning nodes; ensure no columns get lost

Co-authored-by: tstadel <[email protected]>

* fix column reordering after renaming of node_input

* simplify tests &  add docu

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: ju-gu <[email protected]>
Co-authored-by: tstadel <[email protected]>
Co-authored-by: Thomas Stadelmann <[email protected]>
  • Loading branch information
5 people committed Jan 14, 2022
1 parent e28bf61 commit a3147ca
Show file tree
Hide file tree
Showing 7 changed files with 220 additions and 133 deletions.
13 changes: 11 additions & 2 deletions docs/_src/api/api/pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@ Runs the pipeline, one node at a time.
#### eval

```python
| eval(labels: List[MultiLabel], params: Optional[dict] = None, sas_model_name_or_path: str = None) -> EvaluationResult
| eval(labels: List[MultiLabel], params: Optional[dict] = None, sas_model_name_or_path: str = None, add_isolated_node_eval: bool = False) -> EvaluationResult
```

Evaluates the pipeline by running the pipeline once per query in debug mode
Expand All @@ -186,6 +186,14 @@ and putting together all data that is needed for evaluation, e.g. calculating me
- Good default for multiple languages: "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
- Large, powerful, but slow model for English only: "cross-encoder/stsb-roberta-large"
- Large model for German only: "deepset/gbert-large-sts"
- `add_isolated_node_eval`: If set to True, in addition to the integrated evaluation of the pipeline, each node is evaluated in isolated evaluation mode.
This mode helps to understand the bottlenecks of a pipeline in terms of output quality of each individual node.
If a node performs much better in the isolated evaluation than in the integrated evaluation, the previous node needs to be optimized to improve the pipeline's performance.
If a node's performance is similar in both modes, this node itself needs to be optimized to improve the pipeline's performance.
The isolated evaluation calculates the upper bound of each node's evaluation metrics under the assumption that it received perfect inputs from the previous node.
To this end, labels are used as input to the node instead of the output of the previous node in the pipeline.
The generated dataframes in the EvaluationResult then contain additional rows, which can be distinguished from the integrated evaluation results based on the
values "integrated" or "isolated" in the column "eval_mode" and the evaluation report then additionally lists the upper bound of each node's evaluation metrics.

<a name="base.Pipeline.get_nodes_by_class"></a>
#### get\_nodes\_by\_class
Expand Down Expand Up @@ -627,7 +635,7 @@ Instance of DocumentStore or None
#### eval

```python
| eval(labels: List[MultiLabel], params: Optional[dict], sas_model_name_or_path: str = None) -> EvaluationResult
| eval(labels: List[MultiLabel], params: Optional[dict] = None, sas_model_name_or_path: Optional[str] = None, add_isolated_node_eval: bool = False) -> EvaluationResult
```

Evaluates the pipeline by running the pipeline once per query in debug mode
Expand All @@ -640,6 +648,7 @@ and putting together all data that is needed for evaluation, e.g. calculating me
params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
- `sas_model_name_or_path`: SentenceTransformers semantic textual similarity model to be used for sas value calculation,
should be path or string pointing to downloadable models.
- `add_isolated_node_eval`: Whether to additionally evaluate the reader based on labels as input instead of output of previous node in pipeline

<a name="standard_pipelines.ExtractiveQAPipeline"></a>
## ExtractiveQAPipeline
Expand Down
24 changes: 12 additions & 12 deletions docs/_src/api/api/primitives.md
Original file line number Diff line number Diff line change
Expand Up @@ -294,7 +294,7 @@ The DataFrames have the following schema:
#### calculate\_metrics

```python
| calculate_metrics(simulated_top_k_reader: int = -1, simulated_top_k_retriever: int = -1, doc_relevance_col: str = "gold_id_match", node_input: str = "prediction") -> Dict[str, Dict[str, float]]
| calculate_metrics(simulated_top_k_reader: int = -1, simulated_top_k_retriever: int = -1, doc_relevance_col: str = "gold_id_match", eval_mode: str = "integrated") -> Dict[str, Dict[str, float]]
```

Calculates proper metrics for each node.
Expand Down Expand Up @@ -324,19 +324,19 @@ as there are situations the result can heavily differ from an actual eval run wi
remarks: there might be a discrepancy between simulated reader metrics and an actual pipeline run with retriever top_k
- `doc_relevance_col`: column in the underlying eval table that contains the relevance criteria for documents.
values can be: 'gold_id_match', 'answer_match', 'gold_id_or_answer_match'
- `node_input`: the input on which the node was evaluated on.
Usually nodes get evaluated on the prediction provided by its predecessor nodes in the pipeline (value='prediction').
- `eval_mode`: the input on which the node was evaluated on.
Usually nodes get evaluated on the prediction provided by its predecessor nodes in the pipeline (value='integrated').
However, as the quality of the node itself can heavily depend on the node's input and thus the predecessor's quality,
you might want to simulate a perfect predecessor in order to get an independent upper bound of the quality of your node.
For example when evaluating the reader use value='label' to simulate a perfect retriever in an ExtractiveQAPipeline.
Values can be 'prediction', 'label'.
Default value is 'prediction'.
For example when evaluating the reader use value='isolated' to simulate a perfect retriever in an ExtractiveQAPipeline.
Values can be 'integrated', 'isolated'.
Default value is 'integrated'.

<a name="schema.EvaluationResult.wrong_examples"></a>
#### wrong\_examples

```python
| wrong_examples(node: str, n: int = 3, simulated_top_k_reader: int = -1, simulated_top_k_retriever: int = -1, doc_relevance_col: str = "gold_id_match", document_metric: str = "recall_single_hit", answer_metric: str = "f1", node_input: str = "prediction") -> List[Dict]
| wrong_examples(node: str, n: int = 3, simulated_top_k_reader: int = -1, simulated_top_k_retriever: int = -1, doc_relevance_col: str = "gold_id_match", document_metric: str = "recall_single_hit", answer_metric: str = "f1", eval_mode: str = "integrated") -> List[Dict]
```

Returns the worst performing queries.
Expand All @@ -357,13 +357,13 @@ See calculate_metrics() for more information.
values can be: 'recall_single_hit', 'recall_multi_hit', 'mrr', 'map', 'precision'
- `document_metric`: the answer metric worst queries are calculated with.
values can be: 'f1', 'exact_match' and 'sas' if the evaluation was made using a SAS model.
- `node_input`: the input on which the node was evaluated on.
Usually nodes get evaluated on the prediction provided by its predecessor nodes in the pipeline (value='prediction').
- `eval_mode`: the input on which the node was evaluated on.
Usually nodes get evaluated on the prediction provided by its predecessor nodes in the pipeline (value='integrated').
However, as the quality of the node itself can heavily depend on the node's input and thus the predecessor's quality,
you might want to simulate a perfect predecessor in order to get an independent upper bound of the quality of your node.
For example when evaluating the reader use value='label' to simulate a perfect retriever in an ExtractiveQAPipeline.
Values can be 'prediction', 'label'.
Default value is 'prediction'.
For example when evaluating the reader use value='isolated' to simulate a perfect retriever in an ExtractiveQAPipeline.
Values can be 'integrated', 'isolated'.
Default value is 'integrated'.

<a name="schema.EvaluationResult.save"></a>
#### save
Expand Down
39 changes: 26 additions & 13 deletions haystack/nodes/reader/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from functools import wraps
from time import perf_counter

from haystack.schema import Document, Answer, Span
from haystack.schema import Document, Answer, Span, MultiLabel
from haystack.nodes.base import BaseComponent


Expand Down Expand Up @@ -55,7 +55,22 @@ def _calc_no_answer(no_ans_gaps: Sequence[float],

return no_ans_prediction, max_no_ans_gap

def run(self, query: str, documents: List[Document], top_k: Optional[int] = None): # type: ignore
@staticmethod
def add_doc_meta_data_to_answer(documents: List[Document], answer):
# Add corresponding document_name and more meta data, if the answer contains the document_id
if answer.meta is None:
answer.meta = {}
# get meta from doc
meta_from_doc = {}
for doc in documents:
if doc.id == answer.document_id:
meta_from_doc = deepcopy(doc.meta)
break
# append to "own" meta
answer.meta.update(meta_from_doc)
return answer

def run(self, query: str, documents: List[Document], top_k: Optional[int] = None, labels: Optional[MultiLabel] = None, add_isolated_node_eval: bool = False): # type: ignore
self.query_count += 1
if documents:
predict = self.timing(self.predict, "query_time")
Expand All @@ -64,17 +79,15 @@ def run(self, query: str, documents: List[Document], top_k: Optional[int] = None
results = {"answers": []}

# Add corresponding document_name and more meta data, if an answer contains the document_id
for ans in results["answers"]:
if ans.meta is None:
ans.meta = {}
# get meta from doc
meta_from_doc = {}
for doc in documents:
if doc.id == ans.document_id:
meta_from_doc = deepcopy(doc.meta)
break
# append to "own" meta
ans.meta.update(meta_from_doc)
results["answers"] = [BaseReader.add_doc_meta_data_to_answer(documents=documents, answer=answer) for answer in results["answers"]]

# run evaluation with labels as node inputs
if add_isolated_node_eval and labels is not None:
relevant_documents = [label.document for label in labels.labels]
results_label_input = predict(query=query, documents=relevant_documents, top_k=top_k)

# Add corresponding document_name and more meta data, if an answer contains the document_id
results["answers_isolated"] = [BaseReader.add_doc_meta_data_to_answer(documents=documents, answer=answer) for answer in results_label_input["answers"]]

return results, "output_1"

Expand Down
Loading

0 comments on commit a3147ca

Please sign in to comment.