Results is weird for Qwen2-1.5B #1944

SefaZeng · 2024-06-11T02:48:59Z

I try to test the results of some benchmarks for Qwen2-1.5B. But the score is too low:

| Tasks  |Version|Filter|n-shot| Metric |Value |   |Stderr|
|--------|------:|------|-----:|--------|-----:|---|-----:|
|arc_easy|      1|none  |     0|acc     |0.2647|±  |0.0091|
|        |       |none  |     0|acc_norm|0.2597|±  |0.0090|

|  Tasks   |Version|Filter|n-shot|Metric|Value |   |Stderr|
|----------|------:|------|-----:|------|-----:|---|-----:|
|winogrande|      1|none  |     0|acc   |0.5107|±  | 0.014|

The text was updated successfully, but these errors were encountered:

haileyschoelkopf · 2024-06-11T03:41:54Z

Hi! Not sure when I might have time to dig deeper on this specifically, but it might be the case that Qwen uses MMLU-style prompting for ARC , a YAML config for which can be found in the Appendices of our recent paper https://arxiv.org/abs/2405.14782 .

SefaZeng · 2024-06-11T08:11:53Z

Hi! Not sure when I might have time to dig deeper on this specifically, but it might be the case that Qwen uses MMLU-style prompting for ARC , a YAML config for which can be found in the Appendices of our recent paper https://arxiv.org/abs/2405.14782 .

Thanks for your reply. But the result of Qwen1.5-1B8 looks good and its acc_norm is 0.5901 for arc_easy. I am just wondering why Qwen2's result looks like a random guess, maybe it's about the prompt issue (special tokens etc..).

haileyschoelkopf · 2024-06-11T13:18:48Z

Ah, I see, thanks for reporting.

We have seen some unique quirks / edge cases pop up for previous Qwen models--e.g. the tokenizer for Qwen1 does not quite follow the typical HF tokenizers interface. Would definitely recommend checking tokenization if you have the bandwidth to investigate this.

How is the measured MMLU performance of this model? Do those match their reported results, or does MMLU also seem lower than anticipated when you eval it in the Harness?

SefaZeng · 2024-06-12T02:57:50Z

Ah, I see, thanks for reporting.

We have seen some unique quirks / edge cases pop up for previous Qwen models--e.g. the tokenizer for Qwen1 does not quite follow the typical HF tokenizers interface. Would definitely recommend checking tokenization if you have the bandwidth to investigate this.

How is the measured MMLU performance of this model? Do those match their reported results, or does MMLU also seem lower than anticipated when you eval it in the Harness?

The score of MMLU is also quite low.

Groups	Version	Filter	n-shot	Metric	Value		Stderr
mmlu	N/A	none	0	acc	0.2295	±	0.0035
- humanities	N/A	none	5	acc	0.2421	±	0.0062
- other	N/A	none	5	acc	0.2398	±	0.0076
- social_sciences	N/A	none	5	acc	0.2171	±	0.0074
- stem	N/A	none	5	acc	0.2125	±	0.0073

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Results is weird for Qwen2-1.5B #1944

Results is weird for Qwen2-1.5B #1944

SefaZeng commented Jun 11, 2024

haileyschoelkopf commented Jun 11, 2024

SefaZeng commented Jun 11, 2024

haileyschoelkopf commented Jun 11, 2024

SefaZeng commented Jun 12, 2024

Results is weird for Qwen2-1.5B #1944

Results is weird for Qwen2-1.5B #1944

Comments

SefaZeng commented Jun 11, 2024

haileyschoelkopf commented Jun 11, 2024

SefaZeng commented Jun 11, 2024

haileyschoelkopf commented Jun 11, 2024

SefaZeng commented Jun 12, 2024