Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A major problem with the multiple-choice evaluation #141

Closed
YongLD opened this issue Apr 4, 2024 · 7 comments
Closed

A major problem with the multiple-choice evaluation #141

YongLD opened this issue Apr 4, 2024 · 7 comments

Comments

@YongLD
Copy link

YongLD commented Apr 4, 2024

There is a major problem with the multiple-choice evaluation.
I am testing MMBench-dev-en here, I use the result document generated by the llava framework--llava_MMBench_DEV_EN.xlsx, the result of your test here is 0.68.

Because his prediction is all a single word--just the options, so I tried to match it myself, just if item['prediction']==item['answer'], and I found that the final result is 0.77, so your test standard is seriously wrong, or I missed something, please let me know.

If you want the result file to test, I can send you the result file, or you can just have a check.

image

@YongLD
Copy link
Author

YongLD commented Apr 4, 2024

Evaluation code is too complex, but sometimes the results generated by the model are simple

@kennymckormick
Copy link
Member

Would you please upload the excel file and share with me so that I can have a check?
Besides, please note that we use CircularEval for MMBench, thus the final accuracy is not simply the average of all atomic tests. In fact, in CircularEval, you fail on a question if you fail on any of the 4 circular passes.
To learn more about CircularEval, you can refer to the original paper of MMBench: https://arxiv.org/abs/2307.06281

@YongLD
Copy link
Author

YongLD commented Apr 8, 2024

I understand, but if you save the response directly in the xlsx file, and then you evaluate the the results based on the xlsx file, I will be right in my way.
llava_v1.5_13b_MMBench_DEV_EN_Base_DB2.xlsx

@YongLD
Copy link
Author

YongLD commented Apr 8, 2024

Has this evaluation method been used on other multi-choise datasets?

@kennymckormick
Copy link
Member

Has this evaluation method been used on other multi-choise datasets?

No, it's only used for MMBench Series (MMBench, MMBench-CN, CCBench)

@YongLD
Copy link
Author

YongLD commented Apr 8, 2024

No, it's only used for MMBench Series (MMBench, MMBench-CN, CCBench)

I see. But after looking at the source code, I still don’t quite understand where you performed this operation. The only difference is following evaluate/multi-choise.py:

    if listinstr(['mmbench', 'ccbench'], dataset.lower()):
        data = load(eval_file)
        data['index'] = [int(x) for x in data['index']]
        dump(data, eval_file)

Where do you distinguish the evaluation scheme of mmbench from other datasets?
Has the index already been processed in the downloaded dataset?

@LightDXY
Copy link
Contributor

LightDXY commented Apr 8, 2024

A simple if item['prediction']==item['answer'] does not consider the rotate evaluation design in the MMBench series, maybe you should read the paper carefully before challenging the correctness of the code.

@YongLD YongLD closed this as completed Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants