Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about LlavaBench Evaluation #194

Closed
justinphan3110 opened this issue May 9, 2024 · 4 comments
Closed

Question about LlavaBench Evaluation #194

justinphan3110 opened this issue May 9, 2024 · 4 comments

Comments

@justinphan3110
Copy link

Hi author,

I saw that there are 3 columns for LlavaBench scores (Relative Score, VLM Score, GPT4 Score), and seems like in evaluate code the Relative Score is calculated based on the VLM Score and GPT4 Score. However, seem like the original LLaVA project only report the GPT4 Score (this table). So how did the VLM Score and Relative Score calculated and reported? Or am I missing something?

@kennymckormick
Copy link
Member

Hi, @justinphan3110 ,

TL;DR: The GPT4 score reported by the original LLaVA project is the Relative Score reported in VLMEvalKit.

Why: The context is different. LLaVA Project: GPT4 score means the score is evaluated by GPT4; VLMEvalKit: GPT4 score means it's the score of the reference answer (answered by GPT4).

@justinphan3110
Copy link
Author

oh so to clarify, for example in this llava-1.6 table, the LLaVA author show that LLaVA-1.6-Mistral-7B got 83 on LlavaBench but this is the score I got from running on VLMEvalKit:

 split  Relative Score (main)  VLM Score  GPT4 Score
0  overall                   66.6       52.2        78.3
1     conv                   57.9       49.4        85.3
2  complex                   76.2       58.2        76.4
3   detail                   59.5       44.0        74.0

This is the cmd that I used:

model="llava_next_mistral_7b"
task="LLaVABench"

!torchrun --nproc-per-node={num_gpus} run.py --data {task} --model {model} --nproc 20

Is this big gap in scores (66.6 and 83) expected?

@kennymckormick
Copy link
Member

oh so to clarify, for example in this llava-1.6 table, the LLaVA author show that LLaVA-1.6-Mistral-7B got 83 on LlavaBench but this is the score I got from running on VLMEvalKit:

 split  Relative Score (main)  VLM Score  GPT4 Score
0  overall                   66.6       52.2        78.3
1     conv                   57.9       49.4        85.3
2  complex                   76.2       58.2        76.4
3   detail                   59.5       44.0        74.0

This is the cmd that I used:

model="llava_next_mistral_7b"
task="LLaVABench"

!torchrun --nproc-per-node={num_gpus} run.py --data {task} --model {model} --nproc 20

Is this big gap in scores (66.6 and 83) expected?

Did the author mention which version of GPT-4 are they using for evaluating LLaVA 1.6?
The original LLaVABench adopts GPT-4-0314 (which is not available to new users now), while VLMEvalKit adopts GPT-4-1106. The different version of GPT-4 may lead to significant changes in the final score.

@justinphan3110
Copy link
Author

I see, changing to GPT-4-0613 also brought the score back to ~80.0. Thanks for the clarification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants