Skip to content

Commit

Permalink
adding results
Browse files Browse the repository at this point in the history
  • Loading branch information
davidsbatista committed May 30, 2024
1 parent b0dea0b commit 993514a
Show file tree
Hide file tree
Showing 174 changed files with 9,279 additions and 8 deletions.
47 changes: 47 additions & 0 deletions evaluations/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Evaluations

| Dataset and Evaluation | Colab |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| RAG over ARAGOG dataset | <a href="https://colab.research.google.com/github/deepset-ai/haystack-evaluation/blob/main/evaluations/evaluation_aragog.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> |



## ARAGOG

This dataset is based on the paper [Advanced Retrieval Augmented Generation Output Grading (ARAGOG)](https://arxiv.org/pdf/2404.01037).
It's a collection of papers from ArXiv covering topics around Transformers and Large Language Models, all in PDF format.

The dataset contains:
- 13 PDF papers
- 107 questions and answers generated with the assistance of GPT-4, and validated/corrected by humans.

It has human annotations for the following metrics:
- [ContextRelevance](https://docs.haystack.deepset.ai/docs/contextrelevanceevaluator)
- [Faithfulness](https://docs.haystack.deepset.ai/docs/faithfulnessevaluator)
- [Semantic Answer Similarity](https://docs.haystack.deepset.ai/docs/sasevaluator)

Check the [RAG over ARAGOG dataset notebook](aragog_evaluation.ipynb) for an example.


---

## SQuAD dataset

The SQuAD dataset is a collection of questions and answers from Wikipedia articles.
This dataset is typically used for training and evaluating models for extractive question-answering tasks.

The dataset contains:
- 490 Wikipedia articles in text format
- 98k questions whose answers are spans in the articles

It contains human annotations suitable for the following metrics:
- [Answer Exact Match](https://docs.haystack.deepset.ai/docs/answerexactmatchevaluator)
- [DocumentMRR](https://docs.haystack.deepset.ai/docs/documentmrrevaluator)
- [DocumentMAP](https://docs.haystack.deepset.ai/docs/documentmapevaluator)
- [DocumentRecall](https://docs.haystack.deepset.ai/docs/documentrecallevaluator)
- [Semantic Answer Similarity](https://docs.haystack.deepset.ai/docs/sasevaluator)


Check the [RAG over SQuAD notebook](squad_rag_evaluation.ipynb) for an example.

Check the [Extractive QA over SQuAD notebook](squad_extractive_qa_evaluation.ipynb) for an example.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
12 changes: 6 additions & 6 deletions evaluation_arago.py → evaluations/evaluation_aragog.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,19 +110,19 @@ def parameter_tuning(questions, answers):
out_path.mkdir(exist_ok=True)

for embedding_model in embedding_models:
for top_k in top_k_values:
for chunk_size in chunk_sizes:
for chunk_size in chunk_sizes:
print("Indexing documents")
doc_store = indexing(embedding_model, chunk_size)
for top_k in top_k_values:
name_params = f"{embedding_model.split('/')[-1]}__top_k:{top_k}__chunk_size:{chunk_size}"
print(name_params)
print("Indexing documents")
doc_store = indexing(embedding_model, chunk_size)
print("Running RAG pipeline")
retrieved_contexts, predicted_answers = run_basic_rag(doc_store, questions, embedding_model, top_k)
print(f"Running evaluation")
results, inputs = run_evaluation(questions, answers, retrieved_contexts, predicted_answers, embedding_model)
eval_results = EvaluationRunResult(run_name=name_params, inputs=inputs, results=results)
eval_results.score_report().to_csv(f"{out_path}/score_report_{name_params}.csv")
eval_results.to_pandas().to_csv(f"{out_path}/detailed_{name_params}.csv")
eval_results.score_report().to_csv(f"{out_path}/score_report_{name_params}.csv", index=False)
eval_results.to_pandas().to_csv(f"{out_path}/detailed_{name_params}.csv", index=False)


def main():
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -77,13 +77,13 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 4,
"id": "c79d6283-c639-474e-bb19-0be485ce1df9",
"metadata": {},
"outputs": [],
"source": [
"from tqdm import tqdm\n",
"from architectures.extractive_qa import get_extractive_qa_pipeline\n",
"from architectures.basic_rag import basic_rag\n",
"\n",
"def run_extractive_qa(doc_store, questions, embedding_model, top_k_retriever):\n",
"\n",
Expand Down
File renamed without changes.
File renamed without changes.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Loading

0 comments on commit 993514a

Please sign in to comment.