-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A major problem with the multiple-choice evaluation #141
Comments
Evaluation code is too complex, but sometimes the results generated by the model are simple |
Would you please upload the excel file and share with me so that I can have a check? |
I understand, but if you save the response directly in the xlsx file, and then you evaluate the the results based on the xlsx file, I will be right in my way. |
Has this evaluation method been used on other multi-choise datasets? |
No, it's only used for MMBench Series (MMBench, MMBench-CN, CCBench) |
I see. But after looking at the source code, I still don’t quite understand where you performed this operation. The only difference is following if listinstr(['mmbench', 'ccbench'], dataset.lower()):
data = load(eval_file)
data['index'] = [int(x) for x in data['index']]
dump(data, eval_file) Where do you distinguish the evaluation scheme of mmbench from other datasets? |
A simple |
There is a major problem with the multiple-choice evaluation.
I am testing MMBench-dev-en here, I use the result document generated by the llava framework--llava_MMBench_DEV_EN.xlsx, the result of your test here is 0.68.
Because his prediction is all a single word--just the options, so I tried to match it myself, just
if item['prediction']==item['answer']
, and I found that the final result is 0.77, so your test standard is seriously wrong, or I missed something, please let me know.If you want the result file to test, I can send you the result file, or you can just have a check.
The text was updated successfully, but these errors were encountered: