-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using Language Models as Evaluators #1831
Comments
Multi-step call would probably be more easier to implement. |
@lintangsutawika why not separate multi-step/multi-round from llm-as-a-judge? usually multi-step requires the same model to generate on prompt that includes the previous generation. Seems quite different and no need for such a complicated structure to implement these tasks. Evenmore, llm-as-a-judge seems to be some kind of a metric. Multi-step tasks may require that the currect promt include all previous lm answers which leads to calling the same lm to be judge multiple times. |
Llm-as-a-judge can be considered a type of metric. The point being the score typically associated with a heuristic is now outputted by a language model which may or may not be the same type/version of the model being evaluated. This isn't really related to the multistep idea but is something that could be beneficial for evaluating multistep tasks. |
@lintangsutawika Is there currently any initiative for this feature? I would love to help |
Yes, currently @baberabb is working on a PR for this. I think it'd be great if you can provide some feedback based on your experience on Prometheus. |
@baberabb Hello Baber, nice to meet you! I'd love to collaborate with you on working on this! On which platform could I best communicate with you (e.g., slack, discord)? |
How can we integrate the use of language models to evaluate language model generations?
Currently, lm-eval evaluates language model generations with conventional metrics such as accuracy, bleu, etc. This is proven to have shortcomings such as frequently being a poor proxy of progress or performance. There have been current methods such as using GPT-4 to evaluate other language model generations, other open-source approaches such as Prometheus have also gain interest.
Basic Requirements:
Advanced/Optional Requirements:
The text was updated successfully, but these errors were encountered: