-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AGE-276] Improve robustness AI critique evaluation #1233
Comments
I was looking to the code and that is mainly that gpt3.5-turbo has a lot of errors. Could also be configurable the model to use as evaluator. In addition, gpt4o-mini is cheaper now than gpt3.5. The example prompt can be more exact also to match the output required. |
I think a better solution would be to add the model to be used while configuring the ai-ciritique evaluator. |
I agree that it makes sense to provide a model choice for the user together with a way to configure the evaluator (@jp-agenta it might make sense to think how we can have the same logic for configuring RAG evaluators and AI Critic). However, we still need to provide good defaults. First solution is to lower the temperature. Second is to ask for json. |
To test my hipothesis I made a webhook with the llm judge and gpt-3.5-turbo. With a temperature of 0.2 and teh following prompt
90% of the cases get back only the number of the answer. |
@matallanas I have created a PR based on this discussion: #1938 |
Problem Description:
There are cases where the AI critique evaluator provides a response that includes both a numerical rating and a tex, such as "8 - The output is accurate, well-written, and concise, but lacks detailed information." Currently this fails the evaluation.
Proposed Solution:
AGE-276
The text was updated successfully, but these errors were encountered: