Skip to content

Using Haystack to benchmark different architectures over different datasets

License

Notifications You must be signed in to change notification settings

deepset-ai/haystack-evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

haystack-evaluation

This repository contains examples on how to use Haystack to build different RAG architectures and evaluate their performance over different datasets.


Evaluations

ARAGOG

This dataset is based on the paper Advanced Retrieval Augmented Generation Output Grading (ARAGOG). It's a collection of papers from ArXiv covering topics around Transformers and Large Language Models, all in PDF format.

The dataset contains:

  • 13 PDF papers
  • 107 questions and answers generated with the assistance of GPT-4, and validated/corrected by humans.

It has human annotations for the following metrics:

Check the RAG over ARAGOG dataset notebook for an example.


SQuAD dataset

The SQuAD dataset is a collection of questions and answers from Wikipedia articles. This dataset is typically used for training and evaluating models for extractive question-answering tasks.

The dataset contains:

  • 490 Wikipedia articles in text format
  • 98k questions whose answers are spans in the articles

It contains human annotations suitable for the following metrics:

Check the RAG over SQuAD notebook for an example.

Check the Extractive QA over SQuAD notebook for an example.

About

Using Haystack to benchmark different architectures over different datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published