💻 Benchmarks 💯

We provide scripts for running some well known benchmarks with fastRAG. The goal is to showcase common usecases, such that include BM25 retrievers, Sentence BERT rankers, ColBERT retriever and a FiD generative reader. The scripts can also be used as a starting point to benchmark new and modified workflows with new components.

We report results as compared with the original benchmark papers.

🍻 BeIR Benchmark

BeIR contains diverse IR tasks (Thakur et al. 2021); here we focus on MSMARCO and NaturalQuestions (NQ). For each task, we test two pipelines:

BM25 retriever with a Sentence BERT cross-encoder re-ranker; scripts: MSMARCO, NQ.
ColBERT v2 retriever using a PLAID index; scripts: MSMARCO, NQ.

Results

	nDCG@10	20	50	100	MAP@10	20	50	100	R@10	20	50	100	P@10	20	50	100
MSMARCO
BM25+SBERT	.4007	.4111	.4159	.4163	.3485	.3516	.3524	.3525	.5567	.5962	.6190	.6212	.0582	.0312	.0130	.0065
PLAID	.4610	.4857	.5030	.5103	.3889	.3960	.3990	.3997	.6815	.7765	.8608	.9046	.0712	0.0408	.0182	.0096
NQ
BM25+SBERT	.4732	.4871	.4971	.5002	.4191	.4234	.4253	.4257	.6111	.6631	.7098	.7279	.0725	.0396	.0170	.0087
PLAID	.5411	.5634	.5781	.5828	.4700	.4770	.4797	.4802	.7327	.8157	.8843	.9112	.0871	.0489	.0213	.0110

KILT

KILT contains mutiple Open-Domain tasks, including Question Answering, Slot Filling, Fack Checking, and more (Petroni et al. 2021); For evaluation, we primarily showcase its version of NaturalQuestions (NQ). This version differs from the one in BeIR, as it focuses on the downstream performance of the model, not just the retrieval. The pipelines utilized here are essentially the same as the ones in BeIR, with the inclusion of a FiD model as the reader, which outputs the final answer. The pipelines are:

BM25 retriever with a cross encoder SBERT reranker; scripts: NQ.
ColBERT v2 retriever using a PLAID index; scripts: NQ.

Results

The following are the performance and efficiency results of the pipelines:

	Accuracy	EM	F1	Rouge-L
Natural Questions
BM25+SBERT	41.45	46.88	55.60	55.60
PLAID	46.14	53.29	62.31	58.78

References

Thakur, Nandan, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. “BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models.” In Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=wCu6T5xFjeJ.

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. 2021. KILT: a benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2523–2544, Online. Association for Computational Linguistics. https://aclanthology.org/2021.naacl-main.200.pdf.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

💻 Benchmarks 💯

🍻 BeIR Benchmark

Results

KILT

Results

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

💻 Benchmarks 💯

🍻 BeIR Benchmark

Results

KILT

Results

References