-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Results is weird for Qwen2-1.5B #1944
Comments
Hi! Not sure when I might have time to dig deeper on this specifically, but it might be the case that Qwen uses MMLU-style prompting for ARC , a YAML config for which can be found in the Appendices of our recent paper https://arxiv.org/abs/2405.14782 . |
Thanks for your reply. But the result of Qwen1.5-1B8 looks good and its acc_norm is 0.5901 for arc_easy. I am just wondering why Qwen2's result looks like a random guess, maybe it's about the prompt issue (special tokens etc..). |
Ah, I see, thanks for reporting. We have seen some unique quirks / edge cases pop up for previous Qwen models--e.g. the tokenizer for Qwen1 does not quite follow the typical HF tokenizers interface. Would definitely recommend checking tokenization if you have the bandwidth to investigate this. How is the measured MMLU performance of this model? Do those match their reported results, or does MMLU also seem lower than anticipated when you eval it in the Harness? |
The score of MMLU is also quite low.
|
I try to test the results of some benchmarks for Qwen2-1.5B. But the score is too low:
The text was updated successfully, but these errors were encountered: