Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results is weird for Qwen2-1.5B #1944

Open
SefaZeng opened this issue Jun 11, 2024 · 4 comments
Open

Results is weird for Qwen2-1.5B #1944

SefaZeng opened this issue Jun 11, 2024 · 4 comments

Comments

@SefaZeng
Copy link

I try to test the results of some benchmarks for Qwen2-1.5B. But the score is too low:

| Tasks  |Version|Filter|n-shot| Metric |Value |   |Stderr|
|--------|------:|------|-----:|--------|-----:|---|-----:|
|arc_easy|      1|none  |     0|acc     |0.2647|±  |0.0091|
|        |       |none  |     0|acc_norm|0.2597|±  |0.0090|

|  Tasks   |Version|Filter|n-shot|Metric|Value |   |Stderr|
|----------|------:|------|-----:|------|-----:|---|-----:|
|winogrande|      1|none  |     0|acc   |0.5107|±  | 0.014|
@haileyschoelkopf
Copy link
Contributor

Hi! Not sure when I might have time to dig deeper on this specifically, but it might be the case that Qwen uses MMLU-style prompting for ARC , a YAML config for which can be found in the Appendices of our recent paper https://arxiv.org/abs/2405.14782 .

@SefaZeng
Copy link
Author

Hi! Not sure when I might have time to dig deeper on this specifically, but it might be the case that Qwen uses MMLU-style prompting for ARC , a YAML config for which can be found in the Appendices of our recent paper https://arxiv.org/abs/2405.14782 .

Thanks for your reply. But the result of Qwen1.5-1B8 looks good and its acc_norm is 0.5901 for arc_easy. I am just wondering why Qwen2's result looks like a random guess, maybe it's about the prompt issue (special tokens etc..).

@haileyschoelkopf
Copy link
Contributor

Ah, I see, thanks for reporting.

We have seen some unique quirks / edge cases pop up for previous Qwen models--e.g. the tokenizer for Qwen1 does not quite follow the typical HF tokenizers interface. Would definitely recommend checking tokenization if you have the bandwidth to investigate this.

How is the measured MMLU performance of this model? Do those match their reported results, or does MMLU also seem lower than anticipated when you eval it in the Harness?

@SefaZeng
Copy link
Author

Ah, I see, thanks for reporting.

We have seen some unique quirks / edge cases pop up for previous Qwen models--e.g. the tokenizer for Qwen1 does not quite follow the typical HF tokenizers interface. Would definitely recommend checking tokenization if you have the bandwidth to investigate this.

How is the measured MMLU performance of this model? Do those match their reported results, or does MMLU also seem lower than anticipated when you eval it in the Harness?

The score of MMLU is also quite low.

Groups Version Filter n-shot Metric Value Stderr
mmlu N/A none 0 acc 0.2295 ± 0.0035
- humanities N/A none 5 acc 0.2421 ± 0.0062
- other N/A none 5 acc 0.2398 ± 0.0076
- social_sciences N/A none 5 acc 0.2171 ± 0.0074
- stem N/A none 5 acc 0.2125 ± 0.0073

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants