Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion/Feedback] VLM + Multimodal benchmarking #1155

Open
haileyschoelkopf opened this issue Dec 18, 2023 · 3 comments
Open

[Discussion/Feedback] VLM + Multimodal benchmarking #1155

haileyschoelkopf opened this issue Dec 18, 2023 · 3 comments
Assignees
Labels
opinions wanted For discussing open questions.

Comments

@haileyschoelkopf
Copy link
Collaborator

This is an issue to discuss the feasibility / desirability of multimodal benchmarks to be included and supported in lm-eval. With the increase in multimodal (V)LMs and in benchmarks like MMMU designed to test these models, it's worth discussing either whether lm-eval is easily extensible to these tests or whether we wish to have these in scope for the library.

@haileyschoelkopf haileyschoelkopf added the opinions wanted For discussing open questions. label Dec 18, 2023
@LSinev LSinev mentioned this issue Mar 24, 2024
@ashvinnihalani
Copy link

There is already a fork of lm-evalaution-harness built by the LLaVa team called lmms-eval that is focused on VLM evaluation that can serve a PoC. For what is worth, I already have a private fork of lm-eval that has added LLaVa support and is working with MMMU and the LLava 1.5 7B model through the original codebase. I think the main considerations of an implementation are the following:

  • How tightly do we want to couple VLM and MM benchmarks? Should VLM only be able to run MM benchmarks should we enable both LM and VLM to run MM benchmarks. Do we need to ensure the VLM can run text -only benchmarks
  • Are VLM first party citizens? Do we need to focus on optimizations for these? For example VLLM just introduced VLM support a couple of days ago. Do we need to pull the latest benchmarks
  • Significant effort needs to be invested in optimization of the VLMM benchmarks as image data/video data in particle is very big, at what point does it becomes a problem

If the project owners are aligned on these questions, I can clean up and submit a PR for my LLaVa implementation.

@ashvinnihalani
Copy link

Wanted to follow up

@ashvinnihalani
Copy link

Just wanted to give a heads up that I started the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
opinions wanted For discussing open questions.
Projects
None yet
Development

No branches or pull requests

2 participants