Questions regarding the metrics for SEED bench #20

zjysteven · 2023-12-20T20:36:09Z

Hi,

Thanks for putting up the benchmark and releasing the eval tool. I'm running some experiments on both MMBench and the SEED bench, where I'm having some confusion regarding the metrics in the SEED leaderboard and would appreciate any inputs.

Specifically, I have three questions.

What does "heuristic matching" mean in ExactMatchRate?
I'm not fully understanding the definition of MatchedAcc and ExactMatchAcc (and the difference between them). Would you mind explaining it with a concrete example?
It is mentioned, for the official SEED leaderboard, that For models with limited instruction following capabilities (including qwen_base, MiniGPT-4, InstructBLIP, flamingov2), the performance gap between generation-based evaluation and PPL-based evaluation is significant. I understand what PPL-based evaluation means (ranking options by perplexity), but what does generation-based evaluation mean here?

Thank you in advance for your help.

The text was updated successfully, but these errors were encountered:

kennymckormick · 2023-12-21T05:53:13Z

'Heuristic Matching' means we try to match the prediction with an option using pre-defined procedure. We try to match: 1. The option label (A, B, C, D) or the option content with the VLM prediction. Codes: in:

VLMEvalKit/vlmeval/utils/matching_util.py

Line 58 in 7057afc

def can_infer(answer, choices):
Here is an example with 3 samples:
1. Answer: A; Prediction: blahblah
2. Answer: B; Prediction: B
3. Answer: C; Prediction: A
Among three samples, sample 2, 3 are matched (the prediction can be matched to an option), while sample 1 is not matched.
1. MatchedAcc = 1 / 2 (all matched cases) = 50%
2. ExactMatchAcc = 1 / 3 (all cases) = 33%
generation-based evaluation is the one used in VLMEvalKit, during which we feed question and options to the VLM and calculate the acc based on VLM-generated outputs.

zjysteven · 2023-12-21T15:10:28Z

@kennymckormick Thank you for the reply; it's super helpful. One last follow-up question:

Here is the result I got by evaluating on SEEDBench_IMG. Is the acc here LLMMatchAcc or ExactMatchAcc?

zjysteven · 2023-12-21T19:32:40Z

Another question here (I'm new to this field; need to learn a lot...)

I was evaluating on scene understanding category only and got 73.115896 accuracy, which is not exactly the same as the 73.527549 that I got in the above screenshot (where I evaluated on all categories). Is this expected, or that I did something wrong? I thought that across different evaluation runs the result should be consistent.

In case this information helps, I was creating a new tsv file SEEDBench_IMG_SceneUnderstanding.tsv from the original SEEDBench_IMG.tsv by filtering only the questions whose category is scene understanding, on which I performed the evaluation with the provided eval toolkit.

kennymckormick · 2023-12-22T09:12:05Z

@kennymckormick Thank you for the reply; it's super helpful. One last follow-up question: Here is the result I got by evaluating on SEEDBench_IMG. Is the acc here LLMMatchAcc or ExactMatchAcc?

In this table, the acc here is MatchedAcc.

kennymckormick · 2023-12-22T09:14:51Z

Another question here (I'm new to this field; need to learn a lot...)

I was evaluating on scene understanding category only and got 73.115896 accuracy, which is not exactly the same as the 73.527549 that I got in the above screenshot (where I evaluated on all categories). Is this expected, or that I did something wrong? I thought that across different evaluation runs the result should be consistent.

In case this information helps, I was creating a new tsv file SEEDBench_IMG_SceneUnderstanding.tsv from the original SEEDBench_IMG.tsv by filtering only the questions whose category is scene understanding, on which I performed the evaluation with the provided eval toolkit.

I'm not that sure, but it sounds like some randomness in your model during evaluation. You can run the evaluation again to check if the accuracy number further changes.

zjysteven · 2023-12-23T01:29:06Z

I see. Thank you again for these replies.

kamuyix · 2024-06-14T02:43:41Z

@kennymckormick Thank you for the reply; it's super helpful. One last follow-up question: Here is the result I got by evaluating on SEEDBench_IMG. Is the acc here LLMMatchAcc or ExactMatchAcc?

HI，could you tell me how did you evaluate your result，There was no response when I clicked refresh after submission of seedbench leadboard

zjysteven closed this as completed Dec 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions regarding the metrics for SEED bench #20

Questions regarding the metrics for SEED bench #20

zjysteven commented Dec 20, 2023 •

edited

Loading

kennymckormick commented Dec 21, 2023

zjysteven commented Dec 21, 2023

zjysteven commented Dec 21, 2023

kennymckormick commented Dec 22, 2023

kennymckormick commented Dec 22, 2023

zjysteven commented Dec 23, 2023

kamuyix commented Jun 14, 2024

Questions regarding the metrics for SEED bench #20

Questions regarding the metrics for SEED bench #20

Comments

zjysteven commented Dec 20, 2023 • edited Loading

kennymckormick commented Dec 21, 2023

zjysteven commented Dec 21, 2023

zjysteven commented Dec 21, 2023

kennymckormick commented Dec 22, 2023

kennymckormick commented Dec 22, 2023

zjysteven commented Dec 23, 2023

kamuyix commented Jun 14, 2024

zjysteven commented Dec 20, 2023 •

edited

Loading