Merge branch 'main' into sentence-window-retrieval

deepset-ai · Jun 21, 2024 · 73b68fe · 73b68fe
2 parents 18328be + 8d87f9e
commit 73b68fe
Show file tree

Hide file tree

Showing 119 changed files with 3,853 additions and 258 deletions.
diff --git a/.gitignore b/.gitignore
@@ -162,3 +162,4 @@ cython_debug/
 # MacOS
 .DS_Store
 */.DS_Store
+**/.DS_Store
diff --git a/README.md b/README.md
@@ -1,6 +1,12 @@
 # haystack-evaluation
 
-This repository contains examples on how to use Haystack to build different RAG architectures and evaluate their performance over different datasets.
+This repository contains examples on how to use Haystack to evaluate systems build with Haystack for different tasks 
+and datasets.
+
+This repository is structured as:
+
+- [Evaluations](evaluations/README.md)
+
+- [Techniques/Architectures](evaluations/architectures/README.md)
 
-- [RAG Techniques/Architectures](evaluations/architectures/README.md)
 - [Datasets](datasets/README.md)
diff --git a/datasets/ARAGOG/.DS_Store b/datasets/ARAGOG/.DS_Store
diff --git a/datasets/README.md b/datasets/README.md
@@ -1,8 +1,19 @@
 # Datasets
 
-## 1. ARAGOG
 
-This dataset is based on the paper [Advanced Retrieval Augmented Generation Output Grading (ARAGOG)](https://arxiv.org/pdf/2404.01037). It's a collection of papers from ArXiv covering topics around Transformers and Large Language Models, all in PDF format. 
+## Overview 
+
+
+Name | Suitable Metrics | Description 
+------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|
+ARAGOG | [ContextRelevance](https://docs.haystack.deepset.ai/docs/contextrelevanceevaluator), [Faithfulness](https://docs.haystack.deepset.ai/docs/faithfulnessevaluator), [Semantic Answer Similarity](https://docs.haystack.deepset.ai/docs/sasevaluator) |A collection of papers from ArXiv covering topics around Transformers and Large Language Models, all in PDF format.
+SQuAD 1.1 | [Answer Exact Match](https://docs.haystack.deepset.ai/docs/answerexactmatchevaluator), [DocumentMRR](https://docs.haystack.deepset.ai/docs/documentmrrevaluator), [DocumentMAP](https://docs.haystack.deepset.ai/docs/documentmapevaluator), [DocumentRecall](https://docs.haystack.deepset.ai/docs/documentrecallevaluator) [Semantic Answer Similarity](https://docs.haystack.deepset.ai/docs/sasevaluator) | A collection of questions and answers from Wikipedia articles, typically used for training and evaluating models for extractive question-answering tasks.
+
+
+## ARAGOG
+
+This dataset is based on the paper [Advanced Retrieval Augmented Generation Output Grading (ARAGOG)](https://arxiv.org/pdf/2404.01037). It's a 
+collection of papers from ArXiv covering topics around Transformers and Large Language Models, all in PDF format. 
 
 The dataset contains:
 - 13 PDF papers.
@@ -15,13 +26,13 @@ The following metrics can be used:
 
 
 
-## 2. SQuAD dataset 
-
-The SQuAD 1.1 dataset is a collection of questions and answers from Wikipedia articles, and it's typically used for training and evaluating models for extractive question-answering tasks.
-You can find more about this dataset on the paper [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://aclanthology.org/D16-1264/) and on the official website:
-[https://rajpurkar.github.io/SQuAD-explorer/](https://rajpurkar.github.io/SQuAD-explorer/)
 
+## SQuAD dataset 
 
+The SQuAD 1.1 dataset is a collection of questions and answers from Wikipedia articles, and it's typically used for 
+training and evaluating models for extractive question-answering tasks. You can find more about this dataset on the 
+paper [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://aclanthology.org/D16-1264/) and on the 
+official website [https://rajpurkar.github.io/SQuAD-explorer/](https://rajpurkar.github.io/SQuAD-explorer/)
 
 The dataset contains:
 - 490 Wikipedia articles in text format.

diff --git a/evaluations/README.md b/evaluations/README.md
@@ -1,7 +1,9 @@
 # Evaluations
 
-Name | Dataset | Evaluation Metrics | Colab |
---------------------------------------------------------------------------|---------------|--------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-[RAG Evaluation](evaluation_aragog.py) | ARAGOG | [ContextRelevance](https://docs.haystack.deepset.ai/docs/contextrelevanceevaluator) , [Faithfulness](https://docs.haystack.deepset.ai/docs/faithfulnessevaluator), [Semantic Answer Similarity](https://docs.haystack.deepset.ai/docs/sasevaluator) | <a href="https://colab.research.google.com/github/deepset-ai/haystack-evaluation/blob/main/evaluations/evaluation_aragog.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> |
-[RAG Evaluation](evaluation_squad_rag.py) | SQuAD | [ContextRelevance](https://docs.haystack.deepset.ai/docs/contextrelevanceevaluator) , [Faithfulness](https://docs.haystack.deepset.ai/docs/faithfulnessevaluator), [Semantic Answer Similarity](https://docs.haystack.deepset.ai/docs/sasevaluator) | ToDo |
-[Extractive QA Evaluation](evaluation_squad_extractive_qa.py) | SQuAD | [Answer Exact Match](https://docs.haystack.deepset.ai/docs/answerexactmatchevaluator), [DocumentMRR](https://docs.haystack.deepset.ai/docs/documentmrrevaluator), [DocumentMAP](https://docs.haystack.deepset.ai/docs/documentmapevaluator), [DocumentRecall](https://docs.haystack.deepset.ai/docs/documentrecallevaluator), [Semantic Answer Similarity](https://docs.haystack.deepset.ai/docs/sasevaluator) | ToDo |
+Here we provide full examples on how to use Haystack to evaluate systems build also with Haystack for different tasks and datasets.
+
+Name | Dataset | Evaluation Metrics | Colab |
+----------------------------------------------------------------------------------|---------------|--------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+[RAG with parameter search](evaluation_aragog.py) | ARAGOG | [ContextRelevance](https://docs.haystack.deepset.ai/docs/contextrelevanceevaluator) , [Faithfulness](https://docs.haystack.deepset.ai/docs/faithfulnessevaluator), [Semantic Answer Similarity](https://docs.haystack.deepset.ai/docs/sasevaluator) | <a href="https://colab.research.google.com/github/deepset-ai/haystack-evaluation/blob/main/evaluations/evaluation_aragog.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> |
+[Baseline RAG vs HyDE using Harness](evaluation_aragog_harness.py) | ARAGOG | [ContextRelevance](https://docs.haystack.deepset.ai/docs/contextrelevanceevaluator) , [Faithfulness](https://docs.haystack.deepset.ai/docs/faithfulnessevaluator), [Semantic Answer Similarity](https://docs.haystack.deepset.ai/docs/sasevaluator) | - |
+[Extractive QA with parameter search](evaluation_squad_extractive_qa.py) | SQuAD | [Answer Exact Match](https://docs.haystack.deepset.ai/docs/answerexactmatchevaluator), [DocumentMRR](https://docs.haystack.deepset.ai/docs/documentmrrevaluator), [DocumentMAP](https://docs.haystack.deepset.ai/docs/documentmapevaluator), [DocumentRecall](https://docs.haystack.deepset.ai/docs/documentrecallevaluator), [Semantic Answer Similarity](https://docs.haystack.deepset.ai/docs/sasevaluator) | - |
diff --git a/evaluations/analyze_aragog.ipynb → ...ons/analyze_aragog_parameter_search.ipynb b/evaluations/analyze_aragog.ipynb → ...ons/analyze_aragog_parameter_search.ipynb
@@ -73,20 +73,20 @@
  },
  {
  "cell_type": "code",
- "execution_count": 3,
+ "execution_count": 5,
  "id": "a03966eb-658d-4e16-bce0-e198886eca35",
  "metadata": {
  "id": "a03966eb-658d-4e16-bce0-e198886eca35"
  },
  "outputs": [],
  "source": [
  "import os\n",
- "df = read_scores('results/results_aragog_2024_06_12/')"
+ "df = read_scores('results/aragog_parameter_search_2024_06_12/')"
  ]
  },
  {
  "cell_type": "code",
- "execution_count": 4,
+ "execution_count": 6,
  "id": "a018bfb3-755b-4a4f-9f2d-cf69201f9f6d",
  "metadata": {
  "colab": {
@@ -434,7 +434,7 @@
  "26 3 256 "
  ]
  },
- "execution_count": 4,
+ "execution_count": 6,
  "metadata": {},
  "output_type": "execute_result"
  }
@@ -455,7 +455,7 @@
  },
  {
  "cell_type": "code",
- "execution_count": 33,
+ "execution_count": 7,
  "id": "44d1fce4-430d-4365-b27e-d6e862eabc75",
  "metadata": {},
  "outputs": [
@@ -502,7 +502,7 @@
  },
  {
  "cell_type": "code",
- "execution_count": 43,
+ "execution_count": 8,
  "id": "8616c992-934a-414c-89e8-ea8ccad2408e",
  "metadata": {},
  "outputs": [],
@@ -512,7 +512,7 @@
  },
  {
  "cell_type": "code",
- "execution_count": 44,
+ "execution_count": 9,
  "id": "b4328bc2-dccd-4a18-96d1-818df2d7e8d5",
  "metadata": {},
  "outputs": [
@@ -528,7 +528,7 @@
  "Name: 1, dtype: object"
  ]
  },
- "execution_count": 44,
+ "execution_count": 9,
  "metadata": {},
  "output_type": "execute_result"
  }
@@ -539,7 +539,7 @@
  },
  {
  "cell_type": "code",
- "execution_count": 45,
+ "execution_count": 10,
  "id": "56d6327f-bec1-4fbe-a7fe-9ab4e51b4160",
  "metadata": {},
  "outputs": [
@@ -555,7 +555,7 @@
  "Name: 0, dtype: object"
  ]
  },
- "execution_count": 45,
+ "execution_count": 10,
  "metadata": {},
  "output_type": "execute_result"
  }
@@ -574,7 +574,7 @@
  },
  {
  "cell_type": "code",
- "execution_count": 48,
+ "execution_count": 11,
  "id": "dbc6831a-3eca-461c-8243-a2e9659fb220",
  "metadata": {},
  "outputs": [],
@@ -584,7 +584,7 @@
  },
  {
  "cell_type": "code",
- "execution_count": 49,
+ "execution_count": 12,
  "id": "2c9b5ef4-d141-4219-9af8-6b86d3fbbb62",
  "metadata": {},
  "outputs": [
@@ -600,7 +600,7 @@
  "Name: 17, dtype: object"
  ]
  },
- "execution_count": 49,
+ "execution_count": 12,
  "metadata": {},
  "output_type": "execute_result"
  }
@@ -611,7 +611,7 @@
  },
  {
  "cell_type": "code",
- "execution_count": 50,
+ "execution_count": 13,
  "id": "f001f85f-2e95-43c9-b665-9a4aafa2b70f",
  "metadata": {},
  "outputs": [
@@ -627,7 +627,7 @@
  "Name: 9, dtype: object"
  ]
  },
- "execution_count": 50,
+ "execution_count": 13,
  "metadata": {},
  "output_type": "execute_result"
  }
@@ -646,7 +646,7 @@
  },
  {
  "cell_type": "code",
- "execution_count": 51,
+ "execution_count": 14,
  "id": "d6105b7f-e1b3-4654-a39e-915797fa7c58",
  "metadata": {},
  "outputs": [],
@@ -656,7 +656,7 @@
  },
  {
  "cell_type": "code",
- "execution_count": 52,
+ "execution_count": 15,
  "id": "00d06d75-9f43-40fb-85ec-c4fea16cb1e4",
  "metadata": {},
  "outputs": [
@@ -672,7 +672,7 @@
  "Name: 26, dtype: object"
  ]
  },
- "execution_count": 52,
+ "execution_count": 15,
  "metadata": {},
  "output_type": "execute_result"
  }
@@ -683,7 +683,7 @@
  },
  {
  "cell_type": "code",
- "execution_count": 53,
+ "execution_count": 16,
  "id": "6c6a51b5-6945-4832-b4dd-b273f8ee0fe9",
  "metadata": {},
  "outputs": [
@@ -699,7 +699,7 @@
  "Name: 21, dtype: object"
  ]
  },
- "execution_count": 53,
+ "execution_count": 16,
  "metadata": {},
  "output_type": "execute_result"
  }
@@ -709,28 +709,28 @@
  ]
  },
  {
- "cell_type": "code",
- "execution_count": null,
- "id": "6a5bb992-867c-41a6-bf25-b69645ad19b2",
+ "cell_type": "markdown",
+ "id": "32f2c613-361c-401e-9e41-17899b29eb6d",
  "metadata": {},
- "outputs": [],
- "source": []
+ "source": [
+ "## Let's inspect individual queries for this parameter configuration"
+ ]
  },
  {
  "cell_type": "code",
- "execution_count": 55,
+ "execution_count": 19,
  "id": "68e4ed5e-5afe-4db2-a93e-f4232e733092",
  "metadata": {
  "id": "68e4ed5e-5afe-4db2-a93e-f4232e733092"
  },
  "outputs": [],
  "source": [
- "detailed_best_sas_df = pd.read_csv(\"results/results_aragog_2024_06_12/detailed_msmarco-distilroberta-base-v2__top_k:3__chunk_size:128.csv\")"
+ "detailed_best_sas_df = pd.read_csv(\"results/aragog_parameter_search_2024_06_12/detailed_msmarco-distilroberta-base-v2__top_k:3__chunk_size:128.csv\")"
  ]
  },
  {
  "cell_type": "code",
- "execution_count": 56,
+ "execution_count": 20,
  "id": "c7f425f4-ed35-4625-8824-f06e33622eac",
  "metadata": {
  "id": "c7f425f4-ed35-4625-8824-f06e33622eac",
@@ -952,7 +952,7 @@
  "[107 rows x 7 columns]"
  ]
  },
- "execution_count": 56,
+ "execution_count": 20,
  "metadata": {},
  "output_type": "execute_result"
  }
@@ -963,7 +963,7 @@
  },
  {
  "cell_type": "code",
- "execution_count": 60,
+ "execution_count": 21,
  "id": "0e2da651-7843-4dab-bc55-7f51d9965901",
  "metadata": {
  "id": "0e2da651-7843-4dab-bc55-7f51d9965901"
@@ -992,7 +992,7 @@
  },
  {
  "cell_type": "code",
- "execution_count": 63,
+ "execution_count": 22,
  "id": "3d7e17ed-cbce-4dd6-a69d-ad217630fa23",
  "metadata": {
  "id": "3d7e17ed-cbce-4dd6-a69d-ad217630fa23",
@@ -1177,14 +1177,6 @@
  "source": [
  "inspect(44)"
  ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "69641552-244b-46f3-832a-56aab8db3933",
- "metadata": {},
- "outputs": [],
- "source": []
  }
  ],
  "metadata": {

diff --git a/evaluations/architectures/README.md b/evaluations/architectures/README.md
@@ -1,19 +1,39 @@
 # RAG Techniques/Architectures
 
+## Overview 
 
-## Basic RAG
+Here we provide full examples on how to use Haystack to evaluate systems build also with Haystack for different tasks and datasets.
+
+Name | Code | Description 
+----------------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------------------------|
+Basic RAG | [basic_rag.py](basic_rag.py) | Retrieves the top-k document chunks and then passes them to an LLM generate the answer. 
+Extractive QA | [extractive_qa.py](extractive_qa.py) | Retrieves the top-_k_ documents and uses an extractive QA model to extract the answer from the documents.
+Hypothetical Document Embeddings (HyDE) | [hyde_rag.py](hyde_rag.py) | HyDE generates a hypothetical document from the query and uses it to retrieve similar documents from the document embedding space.
+Sentence-Window Retrieval | ToDo | Breaks down documents into smaller chunks (sentences) and indexes them separately. Retrieves the most relevant sentences and replaces them with the full surrounding context. 
+Document Summary Index | ToDo | ToDo
+Multi-Query | ToDo | ToDo
+Maximal Marginal Relevance (MMR) | ToDo | ToDo
+Cohere Re-ranker | ToDo | ToDo
+LLM-based Re-ranker | ToDo | ToDo
+
+
+
+
+### Basic RAG
 
 This is the baseline RAG technique, that retrieves the top-k document chunks and then uses them to generate the answer.
 It uses the same text chunk for indexing/embedding as well as for generating answers.
 
+---
 
-## Extractive QA
+### Extractive QA
 
-This technique retrieves the top-_k_ documents, but instead of using the generator to generate the answer,  it uses an 
+This technique retrieves the top-_k_ documents, but instead of using the generator to provide the answer, it uses an 
 extractive QA model to extract the answer from the retrieved documents.
 
+---
 
-## Hypothetical Document Embeddings (HyDE)
+### Hypothetical Document Embeddings (HyDE)
 
 HyDE first zero-shot prompts an instruction-following language model to generate a “fake” hypothetical document that 
 captures relevant textual patterns from the initial query - in practice, this is done five times. 
@@ -25,16 +45,11 @@ retrieved based on vector similarity.
 - Paper: [Precise Zero-Shot Dense Retrieval without Relevance Labels](https://aclanthology.org/2023.acl-long.99.pdf)
 - Blog: [HyDE: Hypothetical Document Embeddings for Zero-Shot Dense Retrieval](https://huggingface.co/blog/hyde-zero-shot-dense-retrieval)
 
+---
 
-## Sentence-Window Retrieval 
+### Sentence-Window Retrieval 
 
 The sentence-window approach breaks down documents into smaller chunks (sentences) and indexes them separately.
 
 During retrieval, we retrieve the sentences that are most relevant to the query via similarity search and replace the 
-sentence with the full surrounding context, using a static sentence-window around the context.
-
-## Document Summary Index
-## Multi-Query
-## Maximal Marginal Relevance (MMR) 
-## Cohere Re-ranker 
-## LLM-based Re-ranker
+sentence with the full surrounding context, using a static sentence-window around the context.