find an industry dataset to showcase evaluation metrics #7438

mrm1001 · 2024-03-28T13:30:47Z

Goal:
Showcase the Haystack evaluation metrics on a dataset that is not a toy dataset (e.g. wikipedia articles) but that reflects use cases closer to what our users are working on. The goal of this task is to find an existing "benchmark" dataset (published somewhere already) that comes from an industry use case (e.g. legal, manuals, corporate faqs)

Note: We have the earnings call dataset but we decided against it because we would prefer not to release a new dataset.

Edit 09/04: Could we try to find multiple candidate datasets to be used in tutorial and a separate blog article.

davidsbatista · 2024-04-09T11:11:47Z

I've gathered a few annotated datasets that could are fit for this.

mrm1001 · 2024-04-09T15:26:57Z

After catch up with @vblagoje :

pick a min of 2 datasets that look interesting
check their license to make sure we can host a processed version somewhere
do the processing needed to be able to add them to a Haystack RAG pipeline easily (as documents)
add the processed version of these datasets somewhere that is easy to download (for the tutorial)

vblagoje · 2024-04-12T12:30:28Z

@mrm1001 here are the two datasets:

PubMedQA_instruction
- doesn't need any pre-processing and can be used directly
- it has a permissive MIT licence
Law-StackExchange
- needs flatten pre-processing and then can be uploaded to our repo
- it has a permissive CC BY-SA 4.0 licence
- here is the notebook to flatten the dataset and upload it where needed. Here is the flat version

I'll be on the lookout for more datasets so we can use them in tutorials/social

mrm1001 · 2024-04-12T12:48:03Z

Hi Vladimir, could we split both datasets into 2:

one has the deduplicated set of documents ready to be loaded into a RAG pipeline
the other one has the set of question/context/answer (like now)

vblagoje · 2024-04-12T14:23:37Z

Here is the notebook that dedups PubMedQA_instruction and here is the deduped dataset version

Here is flat deduped Law-StackExchange

LMK if there is anything else to be done

mrm1001 · 2024-04-24T14:24:25Z

Thoughts about how to use these datasets for evaluation:

Pubmed dataset

https://huggingface.co/datasets/vblagoje/PubMedQA_instruction
- it has question (”instruction”), context (”paragraph with the right answer”) and response (”right answer”)
- it is unique by questions (but we might have multiple questions being answered by same context)
  - there are very few duplicates, actually. Could almost be ignored?
- How can this dataset be used to show evaluation
  - use the contexts as source docs
  - use LLM-based evaluation + SAS to compare output answer with the actual answer
  - use doc retrieval metrics to show retrieval (MAP, MRR, recall)
- something to bear in mind: this dataset is very large, it might make sense to downsample it in the tutorial and select rows that have shorter responses (some of them can be very long).

Stack exchange dataset

https://huggingface.co/datasets/vblagoje/Law-StackExchange-Deduplicated
- this ones comes from a legal forum example
- the titles are usually questions
- How can this dataset be used to show evaluation
  - use the "answers" columnn as source docs/contexts, titles as questions, so we know what is the best doc for each question. We could use this to showcase retriever metrics (MAP, MRR, recall).
  - Otherwise I would say we will need to manually create some questions (different from the titles), and then show how to use LLM-based metrics when there are no ground truth answers.

mrm1001 · 2024-04-25T15:43:13Z

Also found by @vblagoje :

AllenAI extractive QA dataset

https://huggingface.co/datasets/allenai/ropes
How can this dataset be used to show evaluation
- we can use the "situation" and "question" as a query (it would be a very long query), and then the background is the doc you need to answer that query, and you have a final answer provided. Even if it's extractiveQA, I think you can still show off our new retriever metrics, LLM-based metrics, and also extractiveQA metrics (recall).

Australian Legal QA

https://huggingface.co/datasets/umarbutler/open-australian-legal-qa?row=0
How can this dataset be used to show evaluation
- I think the "prompt" column would need to be processed to extract the snippet. You can use that as docs. Then you have both a question and a long-form answer that you can use a ground-truth answers.

This was referenced Mar 28, 2024

Showcase Haystack evaluations on an industry dataset #7407

Closed

create an example of how to improve a retriever on industry dataset #7439

Open

mrm1001 added P1 High priority, add to the next sprint topic:eval labels Mar 28, 2024

masci added P2 Medium priority, add to the next sprint if no P1 available and removed P1 High priority, add to the next sprint labels Mar 28, 2024

masci added P1 High priority, add to the next sprint and removed P2 Medium priority, add to the next sprint if no P1 available labels Apr 7, 2024

masci assigned vblagoje Apr 7, 2024

mrm1001 added this to the 2.1.0 milestone Apr 9, 2024

mrm1001 mentioned this issue Apr 9, 2024

Update tutorial on model-based evaluation with Haystack #6790

Closed

mrm1001 closed this as completed May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

find an industry dataset to showcase evaluation metrics #7438

find an industry dataset to showcase evaluation metrics #7438

mrm1001 commented Mar 28, 2024 •

edited

Loading

davidsbatista commented Apr 9, 2024

mrm1001 commented Apr 9, 2024

vblagoje commented Apr 12, 2024 •

edited

Loading

mrm1001 commented Apr 12, 2024

vblagoje commented Apr 12, 2024

mrm1001 commented Apr 24, 2024 •

edited

Loading

mrm1001 commented Apr 25, 2024

find an industry dataset to showcase evaluation metrics #7438

find an industry dataset to showcase evaluation metrics #7438

Comments

mrm1001 commented Mar 28, 2024 • edited Loading

davidsbatista commented Apr 9, 2024

mrm1001 commented Apr 9, 2024

vblagoje commented Apr 12, 2024 • edited Loading

mrm1001 commented Apr 12, 2024

vblagoje commented Apr 12, 2024

mrm1001 commented Apr 24, 2024 • edited Loading

Pubmed dataset

Stack exchange dataset

mrm1001 commented Apr 25, 2024

mrm1001 commented Mar 28, 2024 •

edited

Loading

vblagoje commented Apr 12, 2024 •

edited

Loading

mrm1001 commented Apr 24, 2024 •

edited

Loading