Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

find an industry dataset to showcase evaluation metrics #7438

Closed
Tracked by #7407
mrm1001 opened this issue Mar 28, 2024 · 7 comments
Closed
Tracked by #7407

find an industry dataset to showcase evaluation metrics #7438

mrm1001 opened this issue Mar 28, 2024 · 7 comments
Assignees
Labels
P1 High priority, add to the next sprint topic:eval
Milestone

Comments

@mrm1001
Copy link
Member

mrm1001 commented Mar 28, 2024

Goal:
Showcase the Haystack evaluation metrics on a dataset that is not a toy dataset (e.g. wikipedia articles) but that reflects use cases closer to what our users are working on. The goal of this task is to find an existing "benchmark" dataset (published somewhere already) that comes from an industry use case (e.g. legal, manuals, corporate faqs)

Note: We have the earnings call dataset but we decided against it because we would prefer not to release a new dataset.

Edit 09/04: Could we try to find multiple candidate datasets to be used in tutorial and a separate blog article.

@mrm1001 mrm1001 added P1 High priority, add to the next sprint topic:eval labels Mar 28, 2024
@masci masci added P2 Medium priority, add to the next sprint if no P1 available and removed P1 High priority, add to the next sprint labels Mar 28, 2024
@masci masci added P1 High priority, add to the next sprint and removed P2 Medium priority, add to the next sprint if no P1 available labels Apr 7, 2024
@davidsbatista
Copy link
Contributor

I've gathered a few annotated datasets that could are fit for this.

@mrm1001
Copy link
Member Author

mrm1001 commented Apr 9, 2024

After catch up with @vblagoje :

  • pick a min of 2 datasets that look interesting
  • check their license to make sure we can host a processed version somewhere
  • do the processing needed to be able to add them to a Haystack RAG pipeline easily (as documents)
  • add the processed version of these datasets somewhere that is easy to download (for the tutorial)

@vblagoje
Copy link
Member

vblagoje commented Apr 12, 2024

@mrm1001 here are the two datasets:

  • PubMedQA_instruction

    • doesn't need any pre-processing and can be used directly
    • it has a permissive MIT licence
  • Law-StackExchange

    • needs flatten pre-processing and then can be uploaded to our repo
    • it has a permissive CC BY-SA 4.0 licence
    • here is the notebook to flatten the dataset and upload it where needed. Here is the flat version

I'll be on the lookout for more datasets so we can use them in tutorials/social

@mrm1001
Copy link
Member Author

mrm1001 commented Apr 12, 2024

Hi Vladimir, could we split both datasets into 2:

  • one has the deduplicated set of documents ready to be loaded into a RAG pipeline
  • the other one has the set of question/context/answer (like now)

@vblagoje
Copy link
Member

Here is the notebook that dedups PubMedQA_instruction and here is the deduped dataset version

Here is flat deduped Law-StackExchange

LMK if there is anything else to be done

@mrm1001
Copy link
Member Author

mrm1001 commented Apr 24, 2024

Thoughts about how to use these datasets for evaluation:

Pubmed dataset

  • https://huggingface.co/datasets/vblagoje/PubMedQA_instruction
    • it has question (”instruction”), context (”paragraph with the right answer”) and response (”right answer”)
    • it is unique by questions (but we might have multiple questions being answered by same context)
    • How can this dataset be used to show evaluation
      • use the contexts as source docs
      • use LLM-based evaluation + SAS to compare output answer with the actual answer
      • use doc retrieval metrics to show retrieval (MAP, MRR, recall)
    • something to bear in mind: this dataset is very large, it might make sense to downsample it in the tutorial and select rows that have shorter responses (some of them can be very long).

Stack exchange dataset

  • https://huggingface.co/datasets/vblagoje/Law-StackExchange-Deduplicated
    • this ones comes from a legal forum example
    • the titles are usually questions
    • How can this dataset be used to show evaluation
      • use the "answers" columnn as source docs/contexts, titles as questions, so we know what is the best doc for each question. We could use this to showcase retriever metrics (MAP, MRR, recall).
      • Otherwise I would say we will need to manually create some questions (different from the titles), and then show how to use LLM-based metrics when there are no ground truth answers.

@mrm1001
Copy link
Member Author

mrm1001 commented Apr 25, 2024

Also found by @vblagoje :

AllenAI extractive QA dataset

  • https://huggingface.co/datasets/allenai/ropes
  • How can this dataset be used to show evaluation
    • we can use the "situation" and "question" as a query (it would be a very long query), and then the background is the doc you need to answer that query, and you have a final answer provided. Even if it's extractiveQA, I think you can still show off our new retriever metrics, LLM-based metrics, and also extractiveQA metrics (recall).

Australian Legal QA

@mrm1001 mrm1001 closed this as completed May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 High priority, add to the next sprint topic:eval
Projects
None yet
Development

No branches or pull requests

4 participants