[AGE-276] Improve robustness AI critique evaluation #1233

mmabrouk · 2024-01-19T08:17:22Z

Problem Description:
There are cases where the AI critique evaluator provides a response that includes both a numerical rating and a tex, such as "8 - The output is accurate, well-written, and concise, but lacks detailed information." Currently this fails the evaluation.

Proposed Solution:

Improve the default AI critique prompt to minimize such instances.
Use regular expressions to extract numerical values from the evaluator's responses.
Issue: Evaluation Should Not Fail If Any Individual Evaluator Case Fails #1234

_AGE-276

mmabrouk · 2024-07-04T18:18:08Z

@jp it turns out this is a duplicate issue. I have added some comments (the ones mentioned this morning in AGE-370)

matallanas · 2024-07-26T13:58:42Z

I was looking to the code and that is mainly that gpt3.5-turbo has a lot of errors. Could also be configurable the model to use as evaluator. In addition, gpt4o-mini is cheaper now than gpt3.5. The example prompt can be more exact also to match the output required.

aakrem · 2024-07-26T15:20:42Z

I think a better solution would be to add the model to be used while configuring the ai-ciritique evaluator.

mmabrouk · 2024-07-26T19:15:39Z

I agree that it makes sense to provide a model choice for the user together with a way to configure the evaluator (@jp-agenta it might make sense to think how we can have the same logic for configuring RAG evaluators and AI Critic). However, we still need to provide good defaults.

First solution is to lower the temperature. Second is to ask for json.

matallanas · 2024-07-30T11:01:07Z

To test my hipothesis I made a webhook with the llm judge and gpt-3.5-turbo. With a temperature of 0.2 and teh following prompt

We have an LLM App that we want to evaluate its outputs. Based on the prompt and the parameters provided below evaluate the output based on the evaluation strategy below:
Evaluation strategy: 0 to 1, 0 is very bad and 1 is very good.
Expected Answer Column:{correct_answer}
Evaluate this: {output}

Answer ONLY with the floating number nothing else.

90% of the cases get back only the number of the answer.

mmabrouk · 2024-07-30T11:08:26Z

@matallanas I have created a PR based on this discussion: #1938

mmabrouk added bug Something isn't working evaluation labels Jan 19, 2024

mmabrouk added the linear Created by Linear-GitHub Sync label May 28, 2024

mmabrouk changed the title ~~Improve robustness AI critique evaluation~~ [AGE-276] Improve robustness AI critique evaluation May 28, 2024

mmabrouk added this to the Evaluation view v2 milestone May 28, 2024

mmabrouk modified the milestones: Evaluation view v2, v.52 Jul 3, 2024

mmabrouk added the High Priority label Jul 3, 2024

mmabrouk modified the milestones: v.52, v.53 Jul 7, 2024

mmabrouk added the 1 points Created by Linear-GitHub Sync label Jul 12, 2024

mmabrouk modified the milestones: v.55, v.56 Jul 26, 2024

mmabrouk mentioned this issue Jul 30, 2024

feat(backend): AGE-276 and AGE-471 Improves LLM-as-a-judge reliability #1938

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AGE-276] Improve robustness AI critique evaluation #1233

[AGE-276] Improve robustness AI critique evaluation #1233

mmabrouk commented Jan 19, 2024 •

edited

Loading

mmabrouk commented Jul 4, 2024

matallanas commented Jul 26, 2024

aakrem commented Jul 26, 2024

mmabrouk commented Jul 26, 2024

matallanas commented Jul 30, 2024

mmabrouk commented Jul 30, 2024

[AGE-276] Improve robustness AI critique evaluation #1233

[AGE-276] Improve robustness AI critique evaluation #1233

Comments

mmabrouk commented Jan 19, 2024 • edited Loading

mmabrouk commented Jul 4, 2024

matallanas commented Jul 26, 2024

aakrem commented Jul 26, 2024

mmabrouk commented Jul 26, 2024

matallanas commented Jul 30, 2024

mmabrouk commented Jul 30, 2024

mmabrouk commented Jan 19, 2024 •

edited

Loading