Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Language Models as Evaluators #1831

Open
lintangsutawika opened this issue May 13, 2024 · 6 comments
Open

Using Language Models as Evaluators #1831

lintangsutawika opened this issue May 13, 2024 · 6 comments
Labels
feature request A feature that isn't implemented yet.

Comments

@lintangsutawika
Copy link
Contributor

lintangsutawika commented May 13, 2024

How can we integrate the use of language models to evaluate language model generations?

Currently, lm-eval evaluates language model generations with conventional metrics such as accuracy, bleu, etc. This is proven to have shortcomings such as frequently being a poor proxy of progress or performance. There have been current methods such as using GPT-4 to evaluate other language model generations, other open-source approaches such as Prometheus have also gain interest.

Basic Requirements:

  1. Call lm-eval to directly evaluate previously-saved model generation
  2. The language model "judge" would be loaded as a metric and output a score. Supports both API-Based and HF models. API-based models will likely need to have several different class objects for each provider (OpenAI, Cohere, Google) while the class for HF model would be a single implementation. All classes should inherit from a base class so that future contributors can easily propose their own. Main methods for base class should be around (1) scoring a single sample (2) scoring a batch of samples (3) aggregating results. This should be compatible with how metrics are currently used or implemented. If required, we can also revisit how to refactor metrics in lm-eval.
  3. Should these judges also need metric parameters that are configurable via a yaml?

Advanced/Optional Requirements:

  1. Since metric evaluation occurs after all generations are obtained, it should be possible to flush the GPUs and load the language model judge to GPU for faster.
  2. Option to evaluate judges themselves through means like inter-model agreement
@lintangsutawika lintangsutawika added the feature request A feature that isn't implemented yet. label May 13, 2024
@lintangsutawika
Copy link
Contributor Author

Multi-step call would probably be more easier to implement.

@artemorloff
Copy link
Contributor

@lintangsutawika why not separate multi-step/multi-round from llm-as-a-judge? usually multi-step requires the same model to generate on prompt that includes the previous generation. Seems quite different and no need for such a complicated structure to implement these tasks. Evenmore, llm-as-a-judge seems to be some kind of a metric. Multi-step tasks may require that the currect promt include all previous lm answers which leads to calling the same lm to be judge multiple times.
or i got something wrong?

@lintangsutawika
Copy link
Contributor Author

Llm-as-a-judge can be considered a type of metric. The point being the score typically associated with a heuristic is now outputted by a language model which may or may not be the same type/version of the model being evaluated. This isn't really related to the multistep idea but is something that could be beneficial for evaluating multistep tasks.

@baberabb baberabb mentioned this issue Jun 11, 2024
4 tasks
@SeungoneKim
Copy link

@lintangsutawika Is there currently any initiative for this feature? I would love to help

@lintangsutawika
Copy link
Contributor Author

Yes, currently @baberabb is working on a PR for this. I think it'd be great if you can provide some feedback based on your experience on Prometheus.

@SeungoneKim
Copy link

@baberabb Hello Baber, nice to meet you! I'd love to collaborate with you on working on this! On which platform could I best communicate with you (e.g., slack, discord)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request A feature that isn't implemented yet.
Projects
None yet
Development

No branches or pull requests

3 participants