Add isolated node eval mode in pipeline eval (deepset-ai#1962)

* run predictions on ground-truth docs in reader * build dataframe for closed/open domain eval * fix looping through multilabel * fix looping through multilabel's list of labels * simplify collecting relevant docs * switch closed-domain eval off by default * Add latest docstring and tutorial changes * handle edge case params not given * renaming & generate pipeline eval report * add test case for closed-domain eval metrics * Add latest docstring and tutorial changes * test report of closed-domain eval * report closed-domain metrics only for answer metrics not doc metrics * refactoring * fix mypy & remove comment * add second for-loop & use answer as method input * renaming & add separate loop building docs eval df * Add latest docstring and tutorial changes * source /home/tstad/miniconda3/bin/activatechange column order for evaluatation dataframe (deepset-ai#1957) conda activate haystack-dev2 * change column order for evaluatation dataframe * added missing eval column node_input * generic order for both document and answer returning nodes; ensure no columns get lost Co-authored-by: tstadel <[email protected]> * fix column reordering after renaming of node_input * simplify tests & add docu * Add latest docstring and tutorial changes Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: ju-gu <[email protected]> Co-authored-by: tstadel <[email protected]> Co-authored-by: Thomas Stadelmann <[email protected]>
jamescalam · Jan 14, 2022 · a3147ca · a3147ca
1 parent e28bf61
commit a3147ca
Show file tree

Hide file tree

Showing 7 changed files with 220 additions and 133 deletions.
diff --git a/docs/_src/api/api/pipelines.md b/docs/_src/api/api/pipelines.md
@@ -162,7 +162,7 @@ Runs the pipeline, one node at a time.
 #### eval
 
 ```python
- | eval(labels: List[MultiLabel], params: Optional[dict] = None, sas_model_name_or_path: str = None) -> EvaluationResult
+ | eval(labels: List[MultiLabel], params: Optional[dict] = None, sas_model_name_or_path: str = None, add_isolated_node_eval: bool = False) -> EvaluationResult
 ```
 
 Evaluates the pipeline by running the pipeline once per query in debug mode
@@ -186,6 +186,14 @@ and putting together all data that is needed for evaluation, e.g. calculating me
  - Good default for multiple languages: "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
  - Large, powerful, but slow model for English only: "cross-encoder/stsb-roberta-large"
  - Large model for German only: "deepset/gbert-large-sts"
+- `add_isolated_node_eval`: If set to True, in addition to the integrated evaluation of the pipeline, each node is evaluated in isolated evaluation mode.
+ This mode helps to understand the bottlenecks of a pipeline in terms of output quality of each individual node.
+ If a node performs much better in the isolated evaluation than in the integrated evaluation, the previous node needs to be optimized to improve the pipeline's performance.
+ If a node's performance is similar in both modes, this node itself needs to be optimized to improve the pipeline's performance.
+ The isolated evaluation calculates the upper bound of each node's evaluation metrics under the assumption that it received perfect inputs from the previous node.
+ To this end, labels are used as input to the node instead of the output of the previous node in the pipeline.
+ The generated dataframes in the EvaluationResult then contain additional rows, which can be distinguished from the integrated evaluation results based on the
+ values "integrated" or "isolated" in the column "eval_mode" and the evaluation report then additionally lists the upper bound of each node's evaluation metrics.
 
 <a name="base.Pipeline.get_nodes_by_class"></a>
 #### get\_nodes\_by\_class
@@ -627,7 +635,7 @@ Instance of DocumentStore or None
 #### eval
 
 ```python
- | eval(labels: List[MultiLabel], params: Optional[dict], sas_model_name_or_path: str = None) -> EvaluationResult
+ | eval(labels: List[MultiLabel], params: Optional[dict] = None, sas_model_name_or_path: Optional[str] = None, add_isolated_node_eval: bool = False) -> EvaluationResult
 ```
 
 Evaluates the pipeline by running the pipeline once per query in debug mode
@@ -640,6 +648,7 @@ and putting together all data that is needed for evaluation, e.g. calculating me
  params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
 - `sas_model_name_or_path`: SentenceTransformers semantic textual similarity model to be used for sas value calculation,
  should be path or string pointing to downloadable models.
+- `add_isolated_node_eval`: Whether to additionally evaluate the reader based on labels as input instead of output of previous node in pipeline
 
 <a name="standard_pipelines.ExtractiveQAPipeline"></a>
 ## ExtractiveQAPipeline

diff --git a/docs/_src/api/api/primitives.md b/docs/_src/api/api/primitives.md
@@ -294,7 +294,7 @@ The DataFrames have the following schema:
 #### calculate\_metrics
 
 ```python
- | calculate_metrics(simulated_top_k_reader: int = -1, simulated_top_k_retriever: int = -1, doc_relevance_col: str = "gold_id_match", node_input: str = "prediction") -> Dict[str, Dict[str, float]]
+ | calculate_metrics(simulated_top_k_reader: int = -1, simulated_top_k_retriever: int = -1, doc_relevance_col: str = "gold_id_match", eval_mode: str = "integrated") -> Dict[str, Dict[str, float]]
 ```
 
 Calculates proper metrics for each node.
@@ -324,19 +324,19 @@ as there are situations the result can heavily differ from an actual eval run wi
  remarks: there might be a discrepancy between simulated reader metrics and an actual pipeline run with retriever top_k
 - `doc_relevance_col`: column in the underlying eval table that contains the relevance criteria for documents.
  values can be: 'gold_id_match', 'answer_match', 'gold_id_or_answer_match'
-- `node_input`: the input on which the node was evaluated on.
- Usually nodes get evaluated on the prediction provided by its predecessor nodes in the pipeline (value='prediction').
+- `eval_mode`: the input on which the node was evaluated on.
+ Usually nodes get evaluated on the prediction provided by its predecessor nodes in the pipeline (value='integrated').
  However, as the quality of the node itself can heavily depend on the node's input and thus the predecessor's quality,
  you might want to simulate a perfect predecessor in order to get an independent upper bound of the quality of your node.
- For example when evaluating the reader use value='label' to simulate a perfect retriever in an ExtractiveQAPipeline.
- Values can be 'prediction', 'label'. 
- Default value is 'prediction'.
+ For example when evaluating the reader use value='isolated' to simulate a perfect retriever in an ExtractiveQAPipeline.
+ Values can be 'integrated', 'isolated'.
+ Default value is 'integrated'.
 
 <a name="schema.EvaluationResult.wrong_examples"></a>
 #### wrong\_examples
 
 ```python
- | wrong_examples(node: str, n: int = 3, simulated_top_k_reader: int = -1, simulated_top_k_retriever: int = -1, doc_relevance_col: str = "gold_id_match", document_metric: str = "recall_single_hit", answer_metric: str = "f1", node_input: str = "prediction") -> List[Dict]
+ | wrong_examples(node: str, n: int = 3, simulated_top_k_reader: int = -1, simulated_top_k_retriever: int = -1, doc_relevance_col: str = "gold_id_match", document_metric: str = "recall_single_hit", answer_metric: str = "f1", eval_mode: str = "integrated") -> List[Dict]
 ```
 
 Returns the worst performing queries.
@@ -357,13 +357,13 @@ See calculate_metrics() for more information.
  values can be: 'recall_single_hit', 'recall_multi_hit', 'mrr', 'map', 'precision'
 - `document_metric`: the answer metric worst queries are calculated with.
  values can be: 'f1', 'exact_match' and 'sas' if the evaluation was made using a SAS model.
-- `node_input`: the input on which the node was evaluated on.
- Usually nodes get evaluated on the prediction provided by its predecessor nodes in the pipeline (value='prediction').
+- `eval_mode`: the input on which the node was evaluated on.
+ Usually nodes get evaluated on the prediction provided by its predecessor nodes in the pipeline (value='integrated').
  However, as the quality of the node itself can heavily depend on the node's input and thus the predecessor's quality,
  you might want to simulate a perfect predecessor in order to get an independent upper bound of the quality of your node.
- For example when evaluating the reader use value='label' to simulate a perfect retriever in an ExtractiveQAPipeline.
- Values can be 'prediction', 'label'. 
- Default value is 'prediction'.
+ For example when evaluating the reader use value='isolated' to simulate a perfect retriever in an ExtractiveQAPipeline.
+ Values can be 'integrated', 'isolated'. 
+ Default value is 'integrated'.
 
 <a name="schema.EvaluationResult.save"></a>
 #### save

diff --git a/haystack/nodes/reader/base.py b/haystack/nodes/reader/base.py
@@ -7,7 +7,7 @@
 from functools import wraps
 from time import perf_counter
 
-from haystack.schema import Document, Answer, Span
+from haystack.schema import Document, Answer, Span, MultiLabel
 from haystack.nodes.base import BaseComponent
 
 
@@ -55,7 +55,22 @@ def _calc_no_answer(no_ans_gaps: Sequence[float],
 
  return no_ans_prediction, max_no_ans_gap
 
- def run(self, query: str, documents: List[Document], top_k: Optional[int] = None): # type: ignore
+ @staticmethod
+ def add_doc_meta_data_to_answer(documents: List[Document], answer):
+ # Add corresponding document_name and more meta data, if the answer contains the document_id
+ if answer.meta is None:
+ answer.meta = {}
+ # get meta from doc
+ meta_from_doc = {}
+ for doc in documents:
+ if doc.id == answer.document_id:
+ meta_from_doc = deepcopy(doc.meta)
+ break
+ # append to "own" meta
+ answer.meta.update(meta_from_doc)
+ return answer
+
+ def run(self, query: str, documents: List[Document], top_k: Optional[int] = None, labels: Optional[MultiLabel] = None, add_isolated_node_eval: bool = False): # type: ignore
  self.query_count += 1
  if documents:
  predict = self.timing(self.predict, "query_time")
@@ -64,17 +79,15 @@ def run(self, query: str, documents: List[Document], top_k: Optional[int] = None
  results = {"answers": []}
 
  # Add corresponding document_name and more meta data, if an answer contains the document_id
- for ans in results["answers"]:
- if ans.meta is None:
- ans.meta = {}
- # get meta from doc
- meta_from_doc = {}
- for doc in documents:
- if doc.id == ans.document_id:
- meta_from_doc = deepcopy(doc.meta)
- break
- # append to "own" meta
- ans.meta.update(meta_from_doc)
+ results["answers"] = [BaseReader.add_doc_meta_data_to_answer(documents=documents, answer=answer) for answer in results["answers"]]
+
+ # run evaluation with labels as node inputs
+ if add_isolated_node_eval and labels is not None:
+ relevant_documents = [label.document for label in labels.labels]
+ results_label_input = predict(query=query, documents=relevant_documents, top_k=top_k)
+
+ # Add corresponding document_name and more meta data, if an answer contains the document_id
+ results["answers_isolated"] = [BaseReader.add_doc_meta_data_to_answer(documents=documents, answer=answer) for answer in results_label_input["answers"]]
 
  return results, "output_1"