-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions regarding the metrics for SEED bench #20
Comments
|
@kennymckormick Thank you for the reply; it's super helpful. One last follow-up question: |
Another question here (I'm new to this field; need to learn a lot...) I was evaluating on In case this information helps, I was creating a new tsv file |
In this table, the |
I'm not that sure, but it sounds like some randomness in your model during evaluation. You can run the evaluation again to check if the accuracy number further changes. |
I see. Thank you again for these replies. |
HI,could you tell me how did you evaluate your result,There was no response when I clicked refresh after submission of seedbench leadboard |
Hi,
Thanks for putting up the benchmark and releasing the eval tool. I'm running some experiments on both MMBench and the SEED bench, where I'm having some confusion regarding the metrics in the SEED leaderboard and would appreciate any inputs.
Specifically, I have three questions.
ExactMatchRate
?MatchedAcc
andExactMatchAcc
(and the difference between them). Would you mind explaining it with a concrete example?For models with limited instruction following capabilities (including qwen_base, MiniGPT-4, InstructBLIP, flamingov2), the performance gap between generation-based evaluation and PPL-based evaluation is significant
. I understand whatPPL-based evaluation
means (ranking options by perplexity), but what doesgeneration-based evaluation
mean here?Thank you in advance for your help.
The text was updated successfully, but these errors were encountered: