Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions regarding the metrics for SEED bench #20

Closed
zjysteven opened this issue Dec 20, 2023 · 7 comments
Closed

Questions regarding the metrics for SEED bench #20

zjysteven opened this issue Dec 20, 2023 · 7 comments

Comments

@zjysteven
Copy link

zjysteven commented Dec 20, 2023

Hi,

Thanks for putting up the benchmark and releasing the eval tool. I'm running some experiments on both MMBench and the SEED bench, where I'm having some confusion regarding the metrics in the SEED leaderboard and would appreciate any inputs.

image

Specifically, I have three questions.

  1. What does "heuristic matching" mean in ExactMatchRate?
  2. I'm not fully understanding the definition of MatchedAcc and ExactMatchAcc (and the difference between them). Would you mind explaining it with a concrete example?
  3. It is mentioned, for the official SEED leaderboard, that For models with limited instruction following capabilities (including qwen_base, MiniGPT-4, InstructBLIP, flamingov2), the performance gap between generation-based evaluation and PPL-based evaluation is significant. I understand what PPL-based evaluation means (ranking options by perplexity), but what does generation-based evaluation mean here?

Thank you in advance for your help.

@kennymckormick
Copy link
Member

  1. 'Heuristic Matching' means we try to match the prediction with an option using pre-defined procedure. We try to match: 1. The option label (A, B, C, D) or the option content with the VLM prediction. Codes: in:

    def can_infer(answer, choices):

  2. Here is an example with 3 samples:

    1. Answer: A; Prediction: blahblah
    2. Answer: B; Prediction: B
    3. Answer: C; Prediction: A

    Among three samples, sample 2, 3 are matched (the prediction can be matched to an option), while sample 1 is not matched.

    1. MatchedAcc = 1 / 2 (all matched cases) = 50%
    2. ExactMatchAcc = 1 / 3 (all cases) = 33%
  3. generation-based evaluation is the one used in VLMEvalKit, during which we feed question and options to the VLM and calculate the acc based on VLM-generated outputs.

@zjysteven
Copy link
Author

@kennymckormick Thank you for the reply; it's super helpful. One last follow-up question:
image
Here is the result I got by evaluating on SEEDBench_IMG. Is the acc here LLMMatchAcc or ExactMatchAcc?

@zjysteven
Copy link
Author

Another question here (I'm new to this field; need to learn a lot...)

I was evaluating on scene understanding category only and got 73.115896 accuracy, which is not exactly the same as the 73.527549 that I got in the above screenshot (where I evaluated on all categories). Is this expected, or that I did something wrong? I thought that across different evaluation runs the result should be consistent.

In case this information helps, I was creating a new tsv file SEEDBench_IMG_SceneUnderstanding.tsv from the original SEEDBench_IMG.tsv by filtering only the questions whose category is scene understanding, on which I performed the evaluation with the provided eval toolkit.

@kennymckormick
Copy link
Member

@kennymckormick Thank you for the reply; it's super helpful. One last follow-up question: image Here is the result I got by evaluating on SEEDBench_IMG. Is the acc here LLMMatchAcc or ExactMatchAcc?

In this table, the acc here is MatchedAcc.

@kennymckormick
Copy link
Member

Another question here (I'm new to this field; need to learn a lot...)

I was evaluating on scene understanding category only and got 73.115896 accuracy, which is not exactly the same as the 73.527549 that I got in the above screenshot (where I evaluated on all categories). Is this expected, or that I did something wrong? I thought that across different evaluation runs the result should be consistent.

In case this information helps, I was creating a new tsv file SEEDBench_IMG_SceneUnderstanding.tsv from the original SEEDBench_IMG.tsv by filtering only the questions whose category is scene understanding, on which I performed the evaluation with the provided eval toolkit.

I'm not that sure, but it sounds like some randomness in your model during evaluation. You can run the evaluation again to check if the accuracy number further changes.

@zjysteven
Copy link
Author

I see. Thank you again for these replies.

@kamuyix
Copy link

kamuyix commented Jun 14, 2024

@kennymckormick Thank you for the reply; it's super helpful. One last follow-up question: image Here is the result I got by evaluating on SEEDBench_IMG. Is the acc here LLMMatchAcc or ExactMatchAcc?

HI,could you tell me how did you evaluate your result,There was no response when I clicked refresh after submission of seedbench leadboard

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants