Skip to content

Commit

Permalink
Merge branch 'main' into sentence-window-retrieval
Browse files Browse the repository at this point in the history
  • Loading branch information
davidsbatista committed Jun 21, 2024
2 parents 18328be + 8d87f9e commit 73b68fe
Show file tree
Hide file tree
Showing 119 changed files with 3,853 additions and 258 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -162,3 +162,4 @@ cython_debug/
# MacOS
.DS_Store
*/.DS_Store
**/.DS_Store
10 changes: 8 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
# haystack-evaluation

This repository contains examples on how to use Haystack to build different RAG architectures and evaluate their performance over different datasets.
This repository contains examples on how to use Haystack to evaluate systems build with Haystack for different tasks
and datasets.

This repository is structured as:

- [Evaluations](evaluations/README.md)

- [Techniques/Architectures](evaluations/architectures/README.md)

- [RAG Techniques/Architectures](evaluations/architectures/README.md)
- [Datasets](datasets/README.md)
Binary file removed datasets/ARAGOG/.DS_Store
Binary file not shown.
25 changes: 18 additions & 7 deletions datasets/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,19 @@
# Datasets

## 1. ARAGOG

This dataset is based on the paper [Advanced Retrieval Augmented Generation Output Grading (ARAGOG)](https://arxiv.org/pdf/2404.01037). It's a collection of papers from ArXiv covering topics around Transformers and Large Language Models, all in PDF format.
## Overview


Name | Suitable Metrics | Description
------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|
ARAGOG | [ContextRelevance](https://docs.haystack.deepset.ai/docs/contextrelevanceevaluator), [Faithfulness](https://docs.haystack.deepset.ai/docs/faithfulnessevaluator), [Semantic Answer Similarity](https://docs.haystack.deepset.ai/docs/sasevaluator) |A collection of papers from ArXiv covering topics around Transformers and Large Language Models, all in PDF format.
SQuAD 1.1 | [Answer Exact Match](https://docs.haystack.deepset.ai/docs/answerexactmatchevaluator), [DocumentMRR](https://docs.haystack.deepset.ai/docs/documentmrrevaluator), [DocumentMAP](https://docs.haystack.deepset.ai/docs/documentmapevaluator), [DocumentRecall](https://docs.haystack.deepset.ai/docs/documentrecallevaluator) [Semantic Answer Similarity](https://docs.haystack.deepset.ai/docs/sasevaluator) | A collection of questions and answers from Wikipedia articles, typically used for training and evaluating models for extractive question-answering tasks.


## ARAGOG

This dataset is based on the paper [Advanced Retrieval Augmented Generation Output Grading (ARAGOG)](https://arxiv.org/pdf/2404.01037). It's a
collection of papers from ArXiv covering topics around Transformers and Large Language Models, all in PDF format.

The dataset contains:
- 13 PDF papers.
Expand All @@ -15,13 +26,13 @@ The following metrics can be used:



## 2. SQuAD dataset

The SQuAD 1.1 dataset is a collection of questions and answers from Wikipedia articles, and it's typically used for training and evaluating models for extractive question-answering tasks.
You can find more about this dataset on the paper [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://aclanthology.org/D16-1264/) and on the official website:
[https://rajpurkar.github.io/SQuAD-explorer/](https://rajpurkar.github.io/SQuAD-explorer/)

## SQuAD dataset

The SQuAD 1.1 dataset is a collection of questions and answers from Wikipedia articles, and it's typically used for
training and evaluating models for extractive question-answering tasks. You can find more about this dataset on the
paper [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://aclanthology.org/D16-1264/) and on the
official website [https://rajpurkar.github.io/SQuAD-explorer/](https://rajpurkar.github.io/SQuAD-explorer/)

The dataset contains:
- 490 Wikipedia articles in text format.
Expand Down
12 changes: 7 additions & 5 deletions evaluations/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# Evaluations

Name | Dataset | Evaluation Metrics | Colab |
--------------------------------------------------------------------------|---------------|--------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
[RAG Evaluation](evaluation_aragog.py) | ARAGOG | [ContextRelevance](https://docs.haystack.deepset.ai/docs/contextrelevanceevaluator) , [Faithfulness](https://docs.haystack.deepset.ai/docs/faithfulnessevaluator), [Semantic Answer Similarity](https://docs.haystack.deepset.ai/docs/sasevaluator) | <a href="https://colab.research.google.com/github/deepset-ai/haystack-evaluation/blob/main/evaluations/evaluation_aragog.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> |
[RAG Evaluation](evaluation_squad_rag.py) | SQuAD | [ContextRelevance](https://docs.haystack.deepset.ai/docs/contextrelevanceevaluator) , [Faithfulness](https://docs.haystack.deepset.ai/docs/faithfulnessevaluator), [Semantic Answer Similarity](https://docs.haystack.deepset.ai/docs/sasevaluator) | ToDo |
[Extractive QA Evaluation](evaluation_squad_extractive_qa.py) | SQuAD | [Answer Exact Match](https://docs.haystack.deepset.ai/docs/answerexactmatchevaluator), [DocumentMRR](https://docs.haystack.deepset.ai/docs/documentmrrevaluator), [DocumentMAP](https://docs.haystack.deepset.ai/docs/documentmapevaluator), [DocumentRecall](https://docs.haystack.deepset.ai/docs/documentrecallevaluator), [Semantic Answer Similarity](https://docs.haystack.deepset.ai/docs/sasevaluator) | ToDo |
Here we provide full examples on how to use Haystack to evaluate systems build also with Haystack for different tasks and datasets.

Name | Dataset | Evaluation Metrics | Colab |
----------------------------------------------------------------------------------|---------------|--------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
[RAG with parameter search](evaluation_aragog.py) | ARAGOG | [ContextRelevance](https://docs.haystack.deepset.ai/docs/contextrelevanceevaluator) , [Faithfulness](https://docs.haystack.deepset.ai/docs/faithfulnessevaluator), [Semantic Answer Similarity](https://docs.haystack.deepset.ai/docs/sasevaluator) | <a href="https://colab.research.google.com/github/deepset-ai/haystack-evaluation/blob/main/evaluations/evaluation_aragog.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> |
[Baseline RAG vs HyDE using Harness](evaluation_aragog_harness.py) | ARAGOG | [ContextRelevance](https://docs.haystack.deepset.ai/docs/contextrelevanceevaluator) , [Faithfulness](https://docs.haystack.deepset.ai/docs/faithfulnessevaluator), [Semantic Answer Similarity](https://docs.haystack.deepset.ai/docs/sasevaluator) | - |
[Extractive QA with parameter search](evaluation_squad_extractive_qa.py) | SQuAD | [Answer Exact Match](https://docs.haystack.deepset.ai/docs/answerexactmatchevaluator), [DocumentMRR](https://docs.haystack.deepset.ai/docs/documentmrrevaluator), [DocumentMAP](https://docs.haystack.deepset.ai/docs/documentmapevaluator), [DocumentRecall](https://docs.haystack.deepset.ai/docs/documentrecallevaluator), [Semantic Answer Similarity](https://docs.haystack.deepset.ai/docs/sasevaluator) | - |
Original file line number Diff line number Diff line change
Expand Up @@ -73,20 +73,20 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 5,
"id": "a03966eb-658d-4e16-bce0-e198886eca35",
"metadata": {
"id": "a03966eb-658d-4e16-bce0-e198886eca35"
},
"outputs": [],
"source": [
"import os\n",
"df = read_scores('results/results_aragog_2024_06_12/')"
"df = read_scores('results/aragog_parameter_search_2024_06_12/')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 6,
"id": "a018bfb3-755b-4a4f-9f2d-cf69201f9f6d",
"metadata": {
"colab": {
Expand Down Expand Up @@ -434,7 +434,7 @@
"26 3 256 "
]
},
"execution_count": 4,
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -455,7 +455,7 @@
},
{
"cell_type": "code",
"execution_count": 33,
"execution_count": 7,
"id": "44d1fce4-430d-4365-b27e-d6e862eabc75",
"metadata": {},
"outputs": [
Expand Down Expand Up @@ -502,7 +502,7 @@
},
{
"cell_type": "code",
"execution_count": 43,
"execution_count": 8,
"id": "8616c992-934a-414c-89e8-ea8ccad2408e",
"metadata": {},
"outputs": [],
Expand All @@ -512,7 +512,7 @@
},
{
"cell_type": "code",
"execution_count": 44,
"execution_count": 9,
"id": "b4328bc2-dccd-4a18-96d1-818df2d7e8d5",
"metadata": {},
"outputs": [
Expand All @@ -528,7 +528,7 @@
"Name: 1, dtype: object"
]
},
"execution_count": 44,
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -539,7 +539,7 @@
},
{
"cell_type": "code",
"execution_count": 45,
"execution_count": 10,
"id": "56d6327f-bec1-4fbe-a7fe-9ab4e51b4160",
"metadata": {},
"outputs": [
Expand All @@ -555,7 +555,7 @@
"Name: 0, dtype: object"
]
},
"execution_count": 45,
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -574,7 +574,7 @@
},
{
"cell_type": "code",
"execution_count": 48,
"execution_count": 11,
"id": "dbc6831a-3eca-461c-8243-a2e9659fb220",
"metadata": {},
"outputs": [],
Expand All @@ -584,7 +584,7 @@
},
{
"cell_type": "code",
"execution_count": 49,
"execution_count": 12,
"id": "2c9b5ef4-d141-4219-9af8-6b86d3fbbb62",
"metadata": {},
"outputs": [
Expand All @@ -600,7 +600,7 @@
"Name: 17, dtype: object"
]
},
"execution_count": 49,
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -611,7 +611,7 @@
},
{
"cell_type": "code",
"execution_count": 50,
"execution_count": 13,
"id": "f001f85f-2e95-43c9-b665-9a4aafa2b70f",
"metadata": {},
"outputs": [
Expand All @@ -627,7 +627,7 @@
"Name: 9, dtype: object"
]
},
"execution_count": 50,
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -646,7 +646,7 @@
},
{
"cell_type": "code",
"execution_count": 51,
"execution_count": 14,
"id": "d6105b7f-e1b3-4654-a39e-915797fa7c58",
"metadata": {},
"outputs": [],
Expand All @@ -656,7 +656,7 @@
},
{
"cell_type": "code",
"execution_count": 52,
"execution_count": 15,
"id": "00d06d75-9f43-40fb-85ec-c4fea16cb1e4",
"metadata": {},
"outputs": [
Expand All @@ -672,7 +672,7 @@
"Name: 26, dtype: object"
]
},
"execution_count": 52,
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -683,7 +683,7 @@
},
{
"cell_type": "code",
"execution_count": 53,
"execution_count": 16,
"id": "6c6a51b5-6945-4832-b4dd-b273f8ee0fe9",
"metadata": {},
"outputs": [
Expand All @@ -699,7 +699,7 @@
"Name: 21, dtype: object"
]
},
"execution_count": 53,
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -709,28 +709,28 @@
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6a5bb992-867c-41a6-bf25-b69645ad19b2",
"cell_type": "markdown",
"id": "32f2c613-361c-401e-9e41-17899b29eb6d",
"metadata": {},
"outputs": [],
"source": []
"source": [
"## Let's inspect individual queries for this parameter configuration"
]
},
{
"cell_type": "code",
"execution_count": 55,
"execution_count": 19,
"id": "68e4ed5e-5afe-4db2-a93e-f4232e733092",
"metadata": {
"id": "68e4ed5e-5afe-4db2-a93e-f4232e733092"
},
"outputs": [],
"source": [
"detailed_best_sas_df = pd.read_csv(\"results/results_aragog_2024_06_12/detailed_msmarco-distilroberta-base-v2__top_k:3__chunk_size:128.csv\")"
"detailed_best_sas_df = pd.read_csv(\"results/aragog_parameter_search_2024_06_12/detailed_msmarco-distilroberta-base-v2__top_k:3__chunk_size:128.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 56,
"execution_count": 20,
"id": "c7f425f4-ed35-4625-8824-f06e33622eac",
"metadata": {
"id": "c7f425f4-ed35-4625-8824-f06e33622eac",
Expand Down Expand Up @@ -952,7 +952,7 @@
"[107 rows x 7 columns]"
]
},
"execution_count": 56,
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -963,7 +963,7 @@
},
{
"cell_type": "code",
"execution_count": 60,
"execution_count": 21,
"id": "0e2da651-7843-4dab-bc55-7f51d9965901",
"metadata": {
"id": "0e2da651-7843-4dab-bc55-7f51d9965901"
Expand Down Expand Up @@ -992,7 +992,7 @@
},
{
"cell_type": "code",
"execution_count": 63,
"execution_count": 22,
"id": "3d7e17ed-cbce-4dd6-a69d-ad217630fa23",
"metadata": {
"id": "3d7e17ed-cbce-4dd6-a69d-ad217630fa23",
Expand Down Expand Up @@ -1177,14 +1177,6 @@
"source": [
"inspect(44)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "69641552-244b-46f3-832a-56aab8db3933",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down
39 changes: 27 additions & 12 deletions evaluations/architectures/README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,39 @@
# RAG Techniques/Architectures

## Overview

## Basic RAG
Here we provide full examples on how to use Haystack to evaluate systems build also with Haystack for different tasks and datasets.

Name | Code | Description
----------------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------------------------|
Basic RAG | [basic_rag.py](basic_rag.py) | Retrieves the top-k document chunks and then passes them to an LLM generate the answer.
Extractive QA | [extractive_qa.py](extractive_qa.py) | Retrieves the top-_k_ documents and uses an extractive QA model to extract the answer from the documents.
Hypothetical Document Embeddings (HyDE) | [hyde_rag.py](hyde_rag.py) | HyDE generates a hypothetical document from the query and uses it to retrieve similar documents from the document embedding space.
Sentence-Window Retrieval | ToDo | Breaks down documents into smaller chunks (sentences) and indexes them separately. Retrieves the most relevant sentences and replaces them with the full surrounding context.
Document Summary Index | ToDo | ToDo
Multi-Query | ToDo | ToDo
Maximal Marginal Relevance (MMR) | ToDo | ToDo
Cohere Re-ranker | ToDo | ToDo
LLM-based Re-ranker | ToDo | ToDo




### Basic RAG

This is the baseline RAG technique, that retrieves the top-k document chunks and then uses them to generate the answer.
It uses the same text chunk for indexing/embedding as well as for generating answers.

---

## Extractive QA
### Extractive QA

This technique retrieves the top-_k_ documents, but instead of using the generator to generate the answer, it uses an
This technique retrieves the top-_k_ documents, but instead of using the generator to provide the answer, it uses an
extractive QA model to extract the answer from the retrieved documents.

---

## Hypothetical Document Embeddings (HyDE)
### Hypothetical Document Embeddings (HyDE)

HyDE first zero-shot prompts an instruction-following language model to generate a “fake” hypothetical document that
captures relevant textual patterns from the initial query - in practice, this is done five times.
Expand All @@ -25,16 +45,11 @@ retrieved based on vector similarity.
- Paper: [Precise Zero-Shot Dense Retrieval without Relevance Labels](https://aclanthology.org/2023.acl-long.99.pdf)
- Blog: [HyDE: Hypothetical Document Embeddings for Zero-Shot Dense Retrieval](https://huggingface.co/blog/hyde-zero-shot-dense-retrieval)

---

## Sentence-Window Retrieval
### Sentence-Window Retrieval

The sentence-window approach breaks down documents into smaller chunks (sentences) and indexes them separately.

During retrieval, we retrieve the sentences that are most relevant to the query via similarity search and replace the
sentence with the full surrounding context, using a static sentence-window around the context.

## Document Summary Index
## Multi-Query
## Maximal Marginal Relevance (MMR)
## Cohere Re-ranker
## LLM-based Re-ranker
sentence with the full surrounding context, using a static sentence-window around the context.
Loading

0 comments on commit 73b68fe

Please sign in to comment.