Question about LlavaBench Evaluation #194

justinphan3110 · 2024-05-09T00:24:39Z

Hi author,

I saw that there are 3 columns for LlavaBench scores (Relative Score, VLM Score, GPT4 Score), and seems like in evaluate code the Relative Score is calculated based on the VLM Score and GPT4 Score. However, seem like the original LLaVA project only report the GPT4 Score (this table). So how did the VLM Score and Relative Score calculated and reported? Or am I missing something?

kennymckormick · 2024-05-09T02:52:12Z

Hi, @justinphan3110 ,

TL;DR: The GPT4 score reported by the original LLaVA project is the Relative Score reported in VLMEvalKit.

Why: The context is different. LLaVA Project: GPT4 score means the score is evaluated by GPT4; VLMEvalKit: GPT4 score means it's the score of the reference answer (answered by GPT4).

justinphan3110 · 2024-05-09T03:05:06Z

oh so to clarify, for example in this llava-1.6 table, the LLaVA author show that LLaVA-1.6-Mistral-7B got 83 on LlavaBench but this is the score I got from running on VLMEvalKit:

 split  Relative Score (main)  VLM Score  GPT4 Score
0  overall                   66.6       52.2        78.3
1     conv                   57.9       49.4        85.3
2  complex                   76.2       58.2        76.4
3   detail                   59.5       44.0        74.0

This is the cmd that I used:

model="llava_next_mistral_7b"
task="LLaVABench"

!torchrun --nproc-per-node={num_gpus} run.py --data {task} --model {model} --nproc 20

Is this big gap in scores (66.6 and 83) expected?

kennymckormick · 2024-05-09T03:08:45Z

oh so to clarify, for example in this llava-1.6 table, the LLaVA author show that LLaVA-1.6-Mistral-7B got 83 on LlavaBench but this is the score I got from running on VLMEvalKit:
 split  Relative Score (main)  VLM Score  GPT4 Score
0  overall                   66.6       52.2        78.3
1     conv                   57.9       49.4        85.3
2  complex                   76.2       58.2        76.4
3   detail                   59.5       44.0        74.0
This is the cmd that I used:
model="llava_next_mistral_7b"
task="LLaVABench"

!torchrun --nproc-per-node={num_gpus} run.py --data {task} --model {model} --nproc 20
Is this big gap in scores (66.6 and 83) expected?

Did the author mention which version of GPT-4 are they using for evaluating LLaVA 1.6?
The original LLaVABench adopts GPT-4-0314 (which is not available to new users now), while VLMEvalKit adopts GPT-4-1106. The different version of GPT-4 may lead to significant changes in the final score.

justinphan3110 · 2024-05-09T03:27:53Z

I see, changing to GPT-4-0613 also brought the score back to ~80.0. Thanks for the clarification.

kennymckormick closed this as completed May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about LlavaBench Evaluation #194

Question about LlavaBench Evaluation #194

justinphan3110 commented May 9, 2024

kennymckormick commented May 9, 2024

justinphan3110 commented May 9, 2024

kennymckormick commented May 9, 2024

justinphan3110 commented May 9, 2024

Question about LlavaBench Evaluation #194

Question about LlavaBench Evaluation #194

Comments

justinphan3110 commented May 9, 2024

kennymckormick commented May 9, 2024

justinphan3110 commented May 9, 2024

kennymckormick commented May 9, 2024

justinphan3110 commented May 9, 2024