Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AGE-276] Improve robustness AI critique evaluation #1233

Open
2 of 3 tasks
mmabrouk opened this issue Jan 19, 2024 · 6 comments
Open
2 of 3 tasks

[AGE-276] Improve robustness AI critique evaluation #1233

mmabrouk opened this issue Jan 19, 2024 · 6 comments
Labels
1 points Created by Linear-GitHub Sync bug Something isn't working evaluation High Priority linear Created by Linear-GitHub Sync
Milestone

Comments

@mmabrouk
Copy link
Member

mmabrouk commented Jan 19, 2024

Problem Description:
There are cases where the AI critique evaluator provides a response that includes both a numerical rating and a tex, such as "8 - The output is accurate, well-written, and concise, but lacks detailed information." Currently this fails the evaluation.

Proposed Solution:

AGE-276

@mmabrouk mmabrouk added bug Something isn't working evaluation labels Jan 19, 2024
@mmabrouk mmabrouk added the linear Created by Linear-GitHub Sync label May 28, 2024
@mmabrouk mmabrouk changed the title Improve robustness AI critique evaluation [AGE-276] Improve robustness AI critique evaluation May 28, 2024
@mmabrouk mmabrouk added this to the Evaluation view v2 milestone May 28, 2024
@mmabrouk mmabrouk modified the milestones: Evaluation view v2, v.52 Jul 3, 2024
@mmabrouk
Copy link
Member Author

mmabrouk commented Jul 4, 2024

@jp it turns out this is a duplicate issue. I have added some comments (the ones mentioned this morning in AGE-370)

@mmabrouk mmabrouk modified the milestones: v.52, v.53 Jul 7, 2024
@mmabrouk mmabrouk added the 1 points Created by Linear-GitHub Sync label Jul 12, 2024
@matallanas
Copy link

I was looking to the code and that is mainly that gpt3.5-turbo has a lot of errors. Could also be configurable the model to use as evaluator. In addition, gpt4o-mini is cheaper now than gpt3.5. The example prompt can be more exact also to match the output required.

@aakrem
Copy link
Collaborator

aakrem commented Jul 26, 2024

I think a better solution would be to add the model to be used while configuring the ai-ciritique evaluator.

@mmabrouk
Copy link
Member Author

I agree that it makes sense to provide a model choice for the user together with a way to configure the evaluator (@jp-agenta it might make sense to think how we can have the same logic for configuring RAG evaluators and AI Critic). However, we still need to provide good defaults.

First solution is to lower the temperature. Second is to ask for json.

@mmabrouk mmabrouk modified the milestones: v.55, v.56 Jul 26, 2024
@matallanas
Copy link

To test my hipothesis I made a webhook with the llm judge and gpt-3.5-turbo. With a temperature of 0.2 and teh following prompt

We have an LLM App that we want to evaluate its outputs. Based on the prompt and the parameters provided below evaluate the output based on the evaluation strategy below:
Evaluation strategy: 0 to 1, 0 is very bad and 1 is very good.
Expected Answer Column:{correct_answer}
Evaluate this: {output}

Answer ONLY with the floating number nothing else.

90% of the cases get back only the number of the answer.

@mmabrouk
Copy link
Member Author

@matallanas I have created a PR based on this discussion: #1938

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 points Created by Linear-GitHub Sync bug Something isn't working evaluation High Priority linear Created by Linear-GitHub Sync
Projects
None yet
Development

No branches or pull requests

3 participants