Using Language Models as Evaluators #1831

lintangsutawika · 2024-05-13T06:32:02Z

How can we integrate the use of language models to evaluate language model generations?

Currently, lm-eval evaluates language model generations with conventional metrics such as accuracy, bleu, etc. This is proven to have shortcomings such as frequently being a poor proxy of progress or performance. There have been current methods such as using GPT-4 to evaluate other language model generations, other open-source approaches such as Prometheus have also gain interest.

Basic Requirements:

Call lm-eval to directly evaluate previously-saved model generation
The language model "judge" would be loaded as a metric and output a score. Supports both API-Based and HF models. API-based models will likely need to have several different class objects for each provider (OpenAI, Cohere, Google) while the class for HF model would be a single implementation. All classes should inherit from a base class so that future contributors can easily propose their own. Main methods for base class should be around (1) scoring a single sample (2) scoring a batch of samples (3) aggregating results. This should be compatible with how metrics are currently used or implemented. If required, we can also revisit how to refactor metrics in lm-eval.
Should these judges also need metric parameters that are configurable via a yaml?

Advanced/Optional Requirements:

Since metric evaluation occurs after all generations are obtained, it should be possible to flush the GPUs and load the language model judge to GPU for faster.
Option to evaluate judges themselves through means like inter-model agreement

lintangsutawika · 2024-05-13T14:06:42Z

Multi-step call would probably be more easier to implement.

artemorloff · 2024-05-13T23:12:29Z

@lintangsutawika why not separate multi-step/multi-round from llm-as-a-judge? usually multi-step requires the same model to generate on prompt that includes the previous generation. Seems quite different and no need for such a complicated structure to implement these tasks. Evenmore, llm-as-a-judge seems to be some kind of a metric. Multi-step tasks may require that the currect promt include all previous lm answers which leads to calling the same lm to be judge multiple times.
or i got something wrong?

lintangsutawika · 2024-05-14T03:12:50Z

Llm-as-a-judge can be considered a type of metric. The point being the score typically associated with a heuristic is now outputted by a language model which may or may not be the same type/version of the model being evaluated. This isn't really related to the multistep idea but is something that could be beneficial for evaluating multistep tasks.

SeungoneKim · 2024-06-23T11:33:52Z

@lintangsutawika Is there currently any initiative for this feature? I would love to help

lintangsutawika · 2024-06-23T12:40:04Z

Yes, currently @baberabb is working on a PR for this. I think it'd be great if you can provide some feedback based on your experience on Prometheus.

SeungoneKim · 2024-07-06T13:27:27Z

@baberabb Hello Baber, nice to meet you! I'd love to collaborate with you on working on this! On which platform could I best communicate with you (e.g., slack, discord)?

lintangsutawika added the feature request A feature that isn't implemented yet. label May 13, 2024

baberabb mentioned this issue Jun 11, 2024

LMJudge #1950

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Language Models as Evaluators #1831

Using Language Models as Evaluators #1831

lintangsutawika commented May 13, 2024 •

edited

Loading

lintangsutawika commented May 13, 2024

artemorloff commented May 13, 2024

lintangsutawika commented May 14, 2024

SeungoneKim commented Jun 23, 2024

lintangsutawika commented Jun 23, 2024

SeungoneKim commented Jul 6, 2024

Using Language Models as Evaluators #1831

Using Language Models as Evaluators #1831

Comments

lintangsutawika commented May 13, 2024 • edited Loading

lintangsutawika commented May 13, 2024

artemorloff commented May 13, 2024

lintangsutawika commented May 14, 2024

SeungoneKim commented Jun 23, 2024

lintangsutawika commented Jun 23, 2024

SeungoneKim commented Jul 6, 2024

lintangsutawika commented May 13, 2024 •

edited

Loading