A major problem with the multiple-choice evaluation #141

YongLD · 2024-04-04T09:32:47Z

There is a major problem with the multiple-choice evaluation.
I am testing MMBench-dev-en here, I use the result document generated by the llava framework--llava_MMBench_DEV_EN.xlsx, the result of your test here is 0.68.

Because his prediction is all a single word--just the options, so I tried to match it myself, just if item['prediction']==item['answer'], and I found that the final result is 0.77, so your test standard is seriously wrong, or I missed something, please let me know.

If you want the result file to test, I can send you the result file, or you can just have a check.

The text was updated successfully, but these errors were encountered:

YongLD · 2024-04-04T09:35:33Z

Evaluation code is too complex, but sometimes the results generated by the model are simple

kennymckormick · 2024-04-07T11:16:42Z

Would you please upload the excel file and share with me so that I can have a check?
Besides, please note that we use CircularEval for MMBench, thus the final accuracy is not simply the average of all atomic tests. In fact, in CircularEval, you fail on a question if you fail on any of the 4 circular passes.
To learn more about CircularEval, you can refer to the original paper of MMBench: https://arxiv.org/abs/2307.06281

YongLD · 2024-04-08T07:19:37Z

I understand, but if you save the response directly in the xlsx file, and then you evaluate the the results based on the xlsx file, I will be right in my way.
llava_v1.5_13b_MMBench_DEV_EN_Base_DB2.xlsx

YongLD · 2024-04-08T07:29:20Z

Has this evaluation method been used on other multi-choise datasets?

kennymckormick · 2024-04-08T07:33:00Z

Has this evaluation method been used on other multi-choise datasets?

No, it's only used for MMBench Series (MMBench, MMBench-CN, CCBench)

YongLD · 2024-04-08T08:00:51Z

No, it's only used for MMBench Series (MMBench, MMBench-CN, CCBench)

I see. But after looking at the source code, I still don’t quite understand where you performed this operation. The only difference is following evaluate/multi-choise.py:

    if listinstr(['mmbench', 'ccbench'], dataset.lower()):
        data = load(eval_file)
        data['index'] = [int(x) for x in data['index']]
        dump(data, eval_file)

Where do you distinguish the evaluation scheme of mmbench from other datasets?
Has the index already been processed in the downloaded dataset?

LightDXY · 2024-04-08T13:46:32Z

A simple if item['prediction']==item['answer'] does not consider the rotate evaluation design in the MMBench series, maybe you should read the paper carefully before challenging the correctness of the code.

YongLD closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A major problem with the multiple-choice evaluation #141

A major problem with the multiple-choice evaluation #141

YongLD commented Apr 4, 2024

YongLD commented Apr 4, 2024

kennymckormick commented Apr 7, 2024

YongLD commented Apr 8, 2024 •

edited

YongLD commented Apr 8, 2024 •

edited

kennymckormick commented Apr 8, 2024

YongLD commented Apr 8, 2024 •

edited

LightDXY commented Apr 8, 2024

A major problem with the multiple-choice evaluation #141

A major problem with the multiple-choice evaluation #141

Comments

YongLD commented Apr 4, 2024

YongLD commented Apr 4, 2024

kennymckormick commented Apr 7, 2024

YongLD commented Apr 8, 2024 • edited

YongLD commented Apr 8, 2024 • edited

kennymckormick commented Apr 8, 2024

YongLD commented Apr 8, 2024 • edited

LightDXY commented Apr 8, 2024

YongLD commented Apr 8, 2024 •

edited

YongLD commented Apr 8, 2024 •

edited

YongLD commented Apr 8, 2024 •

edited