-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about LlavaBench Evaluation #194
Comments
Hi, @justinphan3110 , TL;DR: The GPT4 score reported by the original LLaVA project is the Relative Score reported in VLMEvalKit. Why: The context is different. LLaVA Project: GPT4 score means the score is evaluated by GPT4; VLMEvalKit: GPT4 score means it's the score of the reference answer (answered by GPT4). |
oh so to clarify, for example in this llava-1.6 table, the LLaVA author show that LLaVA-1.6-Mistral-7B got
This is the cmd that I used:
Is this big gap in scores (66.6 and 83) expected? |
Did the author mention which version of GPT-4 are they using for evaluating LLaVA 1.6? |
I see, changing to |
Hi author,
I saw that there are 3 columns for LlavaBench scores (Relative Score, VLM Score, GPT4 Score), and seems like in evaluate code the Relative Score is calculated based on the VLM Score and GPT4 Score. However, seem like the original LLaVA project only report the GPT4 Score (this table). So how did the VLM Score and Relative Score calculated and reported? Or am I missing something?
The text was updated successfully, but these errors were encountered: