Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when evaluating mmbench_dev_en:The correct answer according to the 'answer' field in the table should be D, but the log says it is A. #73

Closed
jdy18 opened this issue Jan 26, 2024 · 5 comments

Comments

@jdy18
Copy link

jdy18 commented Jan 26, 2024

image

The correct answer according to the 'answer' field in the table should be D, but the log says it is A.

@jdy18 jdy18 changed the title Error when evaluating mmbench_dev_en The correct answer according to the 'answer' field in the table should be D, but the log says it is A. Jan 26, 2024
@jdy18 jdy18 changed the title The correct answer according to the 'answer' field in the table should be D, but the log says it is A. Error when evaluating mmbench_dev_en:The correct answer according to the 'answer' field in the table should be D, but the log says it is A. Jan 26, 2024
@kennymckormick
Copy link
Member

Hi, @jdy18 ,
That's due to we use Circular Evaluation for MMBench. During evaluation, four rolling passes (index from 0) are evaluated. In each pass, the choices will be shifted. Suppose the original choice list (Rolling pass 0) is: A. answer a; B. answer b; C. answer c; D. answer d, in rolling pass 1, the choice list will be A. answer d; B. answer a; C. answer b; D. answer c. That's why the log shows a different answer compared to answer field (cuz it is in rolling pass 1). You can also see the choice content shown in the log.

@jdy18
Copy link
Author

jdy18 commented Jan 28, 2024

Hi, @jdy18 , That's due to we use Circular Evaluation for MMBench. During evaluation, four rolling passes (index from 0) are evaluated. In each pass, the choices will be shifted. Suppose the original choice list (Rolling pass 0) is: A. answer a; B. answer b; C. answer c; D. answer d, in rolling pass 1, the choice list will be A. answer d; B. answer a; C. answer b; D. answer c. That's why the log shows a different answer compared to answer field (cuz it is in rolling pass 1). You can also see the choice content shown in the log.

Thank you for you attention. I have also encountered an issue when evaluating the MME: it incorrectly interprets response as "unknown" if the answer contains words like "notice", even in a context where the response is affirmative, such as 'Yes, the image is a photo of Friedhof Wilmersdorf. The photograph depicts a gravestone adorned with a death notice and an emblem.' This error arises from the presence of 'no' within 'notice'

@kennymckormick
Copy link
Member

Hi, @jdy18 ,
Yeah, this is a problem in our evaluation and we will create a new PR to fix it.

@kennymckormick
Copy link
Member

@jdy18 , The second problem you specified has been fixed in #81

@jdy18
Copy link
Author

jdy18 commented Feb 10, 2024

@jdy18 , The second problem you specified has been fixed in #81

Thank you for your prompt attention and efforts to address the issues I've raised. Your responsiveness and dedication to improving the system are truly appreciated.

However, I'd like to suggest a couple of enhancements to further refine the evaluation process for multi-choice tasks:

It might be beneficial to implement a mechanism for exact matching of uppercase option letters in multi-choice questions. This could help avoid confusion caused by the presence of quantifiers like "a" in responses, which might be mistakenly interpreted as indicating multiple choices. Additionally, in cases where multiple letters or multiple instances of "yes"/"no" appear, the system could prioritize the analysis of the first word in the sentence to determine the intended response.

I am also curious about whether the scores currently displayed on the OpenCompass leaderboard have been updated to reflect these latest modifications. Could you provide any information on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants