Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluating LLMs on QA Tasks #65

Closed
slavakurilyak opened this issue Mar 14, 2023 · 7 comments
Closed

Evaluating LLMs on QA Tasks #65

slavakurilyak opened this issue Mar 14, 2023 · 7 comments

Comments

@slavakurilyak
Copy link

slavakurilyak commented Mar 14, 2023

Here's an idea on how to evaluate an LLM on various question-answering tasks, such as open-domain question answering, conversational question answering, answer selection, community question answering, and knowledge base question answering:

initialize model
initialize datasets
initialize evaluation_metrics

load_task_data:
    for each task in tasks:
        load data for task
        preprocess data if necessary (e.g., combine review summary and text)
        store data in datasets

embed_task_data:
    for each task in tasks:
        for each example in datasets[task]:
            obtain prompt from example
            obtain prompt_embedding using an embedding function
            store prompt_embedding in example

evaluate_model_on_task:
    for each task in tasks:
        for each example in datasets[task]:
            obtain prompt_embedding from example
            generate_answer_embedding = model.generate(prompt_embedding)

            calculate_metric = evaluation_metrics(example, generate_answer_embedding)
            store_metric_results_for_task(task)

aggregate_and_report_metrics:
    for each task in tasks:
        for each metric in evaluation_metrics:
            calculate average, median, or other aggregate metric values
            report metric value for task

main:
    load_task_data
    embed_task_data
    evaluate_model_on_task
    aggregate_and_report_metrics

I'd like to add a caveat about the pseudocode I provided:

  • The provided pseudocode is only a starting point for exploring the evaluation of QA tasks using embeddings
  • This pseudocode is not complete
  • I invite the community to provide input
@placcaumuhire
Copy link

Hey there! Thanks for sharing your idea on how to evaluate an LLM on various question-answering tasks. I really appreciate your contribution and I think your pseudocode provides a great starting point for exploring and understanding the evaluation process. And you're right, there's always room for improvement, so I encourage you and others to share your thoughts and experiences to help enhance the understanding and implementation of this process. Keep up the good work!

@Abhishekagrawal1404
Copy link

quite impressive

@ricky-sb
Copy link

ricky-sb commented Mar 15, 2023

@slavakurilyak ok I know this is a weird question, but...did you generate this with ChatGPT? 👀

It has a very similar tone. The pseudocode, the disclaimers, the step-by-step thing. It's very similar to when I ask ChatGPT for coding help.

@Abhishekagrawal1404
Copy link

Abhishekagrawal1404 commented Mar 15, 2023 via email

@YoshiDeSchrijver
Copy link

Id like to contribute.

@placcaumuhire
Copy link

🤓 Me ✌️ from 🇷🇼

@andrew-openai
Copy link
Contributor

The tasks described:

question-answering tasks, such as open-domain question answering, conversational question answering, answer selection, community question answering, and knowledge base question answering:

should already be supported by Evals, as you can make the input a Chat conversation object up until the next turn (which is when the model would respond)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants