模型 | 开源 | 中文推理 | 中文语言 | 总分 |
---|---|---|---|---|
GPT-4-1106-preview | - | 7.73 | 8.29 | 8.01 |
DeepSeek-V2-Chat(RL) | √ | 7.45 | 8.36 | 7.91 |
erniebot-4.0-202404 (文心一言) | - | 7.61 | 8.17 | 7.89 |
DeepSeek-V2-Chat(SFT) | √ | 7.30 | 8.17 | 7.74 |
GPT-4-0613 | - | 7.47 | 7.59 | 7.53 |
erniebot-4.0-202312 (文心一言) | - | 6.84 | 7.88 | 7.36 |
moonshot-v1-32k-202404 (月之暗面) | - | 6.42 | 8.02 | 7.22 |
Qwen1.5-72B-Chat (通义千问) | √ | 6.45 | 7.93 | 7.19 |
DeepSeek-67B-Chat | √ | 5.75 | 7.11 | 6.43 |
Yi-34B-Chat (零一万物) | √ | 4.86 | 7.38 | 6.12 |
GPT-3.5-turbo-0613 | - | 5.35 | 6.71 | 6.08 |
小模型 | 开源 | 中文推理 | 中文语言 | 英文 | 编码 |
---|---|---|---|---|---|
Yi-1.5-9B | √ | ||||
Yi-1.5-6B | √ |
Model | English | Chinese | Code | Math | Params | Context |
---|---|---|---|---|---|---|
DeepSeek-V2-Chat(RL) | 157.5 | 159.6 | 185.6 | 146.1 | ||
DeepSeek-V2-Chat(SFT) | 159.7 | 163.3 | 175.9 | 143.5 | ||
LLaMA3-70B-Instruct | 160.4 | 138.6 | 176.5 | 141.7 | ||
Mixtral-8x22B | 156.2 | 121.0 | 164.4 | 137.7 | 44/176 | |
QWen1.5-72B-Chat | 142.1 | 165.1 | 140.9 | 122.5 | ||
DeepSeek-V2(MoE-236B) | 157.4 | 165.7 | 115.4 | 122.8 | 128k | |
DeepSeek-V1-Chat(SFT) | 142.8 | 133.0 | 153.5 | 116.7 | ||
LLaMA3-70B | 159.9 | 136.8 | 116.8 | 125.2 | ||
Mixtral-8x7B | 13/56 | |||||
DeepSeek-V1(Dense-67B) | 139.9 | 136.9 | 102.5 | 82.1 | ||
DeepSeek-V2-Lite-Chat | 2.4/15.7 | 32K | ||||
Arctic-128×3.66B(MoE-480B) | 17/480 |
English Domain | MMLU | BBH | Total |
---|---|---|---|
Claude-3-Opus | 86.8%(5-shot) | 86.8%(3-shot) | |
LLaMA3-70B-Instruct | 80.3 | 80.1 | 160.4 |
LLaMA3-70B | 78.9 | 81.0 | 159.9 |
DeepSeek-V2-Chat(SFT) | 78.4 | 81.3 | 159.7 |
DeepSeek-V2-Chat(RL) | 77.8 | 79.7 | 157.5 |
DeepSeek-V2(MoE-236B) | 78.5 | 78.9 | 157.4 |
Mixtral-8x22B | 77.6 | 78.9 | 156.5 |
Mixtral-8x7B | 70.4 | ||
DeepSeek-V1 Chat(SFT) | 71.1 | 71.7 | 142.8 |
QWen1.5-72B-Chat | 77.3 | 65.9 | 142.1 |
Yi-1.5-34B-Chat | 76.8 | ||
Yi-1.5-9B-Chat | 69.5 | 72.4 | |
Yi-1.5-6B-Chat | 63.5 | 59.0 | |
QWen1.5-32B-Chat | 74.3 | ||
Mixtral-8x7B-Instruct-v0.1 | 71.4 | ||
Mixtral-8x22B-Instruct-v0.1 | 77.7 | ||
DeepSeek-V1(Dense-67B) | 71.3 | 68.7 | 139.0 |
GPT-4 | 86.4 | 86.7 | |
DeepSeek-V2-Lite-Chat | 55.7 | 48.1 | |
DeepSeekMoE-16B-Chat | 47.2 | 42.2 | |
DeepSeek-7B-Chat | 49.7 | 43.1 | |
Arctic-128×3.66B(MoE-480B) | 67.3? |
Chinese Domain | C-Eval | CMMLU | CLUEWSC |
---|---|---|---|
DeepSeek-V2 (MoE-236B) | 81.7 | 84.0 | |
QWen1.5-72B-Chat | 82.2 | 82.9 | |
DeepSeek-V2-Chat(SFT) | 80.9 | 82.4 | |
DeepSeek-V2-Chat(RL) | 78.0 | 81.6 | |
LLaMA3-70B-Instruct | 67.9 | 70.7 | |
DeepSeek-V1(Dense-67B) | 66.1 | 70.8 | |
LLaMA3-70B | 67.5 | 69.3 | |
DeepSeek-V1-Chat(SFT) | 65.2 | 67.8 | |
Mixtral-8x22B | 60.0 | 61.0 | |
GPT-4 | 69.9 | 71.0 | |
QWen-14B-Chat | 71.7 | 70.0 | |
Yi-34B-Chat | 77.71 | 73.52 | |
QWen1.5-7B-Chat | 73.4 | ||
Yi-1.5-9B | 74.8 | ||
Yi-1.5-6B | 70.8 | ||
DeepSeek-V2-Lite-Chat | 60.1 | 62.5 | 80.0 |
DeepSeekMoE-16B-Chat | 40.0 | 49.3 | 68.2 |
DeepSeek-7B-Chat | 44.7 | 51.2 | 66.2 |
Code Domain | HumanEval | MBPP | LiveCodeBench(0901-0401) | MT-Bench |
---|---|---|---|---|
Claude-3-Opus | 84.9%(0-shot) | |||
DeepSeek-V2-Chat(RL) | 81.1 | 72.0 | 32.5 | |
LLaMA3 70B Instruct | 76.2 | 69.8 | 30.5 | |
DeepSeek-V2-Chat(SFT) | 76.8 | 70.4 | 28.7 | |
Yi-1.5-34B-Chat | 75.2 | 74.6 | 8.5 | |
Mixtral-8x22B | 75.0 | 64.4 | 25.0 | |
DeepSeek-V1-Chat(SFT) | 73.8 | 61.4 | 18.3 | |
QWen1.5-72B-Chat | 64.6 | 72.5 | 18.8 | 8.61 |
LLaMA3-70B | 48.2 | 68.6 | ||
DeepSeek-V2(MoE-236B) | 48.8 | 66.6 | ||
Yi-1.5-9B-Chat | 66.5 | 78.8 | 8.2 | |
Yi-1.5-6B-Chat | 64.0 | 70.9 | 7.5 | |
LLaMA3-8B-Instruct | 61.6 | 61.4 | 8.0 | |
DeepSeek-V1(Dense-67B) | 45.1 | 57.4 | ||
QWen1.5-32B-Chat | 51.2 | 66.9 | 8.3 | |
QWen1.5-14B-Chat | 7.91 | |||
Mixtral-8x7B-Instruct-v0.1 | 45.1 | 59.5 | 8.3 | |
Mixtral-8x22B-Instruct-v0.1 | 76.2 | 73.8 | 8.6 | |
QWen1.5-7B-Chat | 36.0 | 46.1 | 7.60 | |
Yi-1.5-9B | 41.4 | 61.1 | ||
Yi-1.5-6B | 36.5 | 56.8 | ||
DeepSeek-V2-Lite-Chat | 57.3 | 45.8 | ||
DeepSeekMoE-16B-Chat | 45.7 | 46.2 | ||
DeepSeek-7B-Chat | 45.1 | 39.0 |
HumanEval | Pass@1 | Pass@10 | 0-shot | 5-shot |
---|---|---|---|---|
Claude-3-Opus | 84.9% | |||
StarCoder2-15B | ||||
StarCoder2-7B | ||||
StarCoder2-3B | ||||
LLaMA3-70B | 81.7 | |||
LLaMA3-8B | 62.2 | |||
Yi-Chat-34B | 7.9 | |||
QWen-14B-Chat | 11.1 | |||
DeepSeek-Coder-33B-Instruct | 31.7 | |||
GPT-4-Turbo | 48.4 |
Math Domain | GSM8K | MATH | CMath |
---|---|---|---|
Claude-3-Opus | 95.0%(0-shot) | 60.1%(0-shot) | |
DeepSeek-V2 Chat (RL) | 92.2 | 53.9 | |
DeepSeek-V2 Chat (SFT) | 90.8 | 52.7 | |
LLaMA3-70B Instruct | 93.2 | 48.5 | |
Mixtral-8x22B | 87.9 | 49.8 | |
LLaMA3-70B | 83.0 | 42.2 | |
DeepSeek-V2 (MoE-236B) | 79.2 | 43.6 | |
QWen1.5-72B-Chat | 86.0 | 44.4 | |
DeepSeek-V1 Chat (SFT) | 84.1 | 32.6 | |
DeepSeek-V1 (Dense-67B) | 63.4 | 18.7 | |
Yi-1.5-34B-Chat | 90.2 | 50.1 | |
QWen1.5-32B-Chat | 83.9 | 43.3 | |
Mixtral-8x7B-Instruct-v0.1 | 65.7 | 28.4 | |
Mixtral-8x22B-Instruct-v0.1 | 84.0 | 41.1 | |
QWen1.5-7B-Chat | 70.1 | 20.3 | |
LLaMA3-70B | 54.7 | 21.16 | |
Yi-1.5-9B | 73.7 | 32.6 | |
Yi-1.5-6B | 62.2 | 28.42 | |
DeepSeek-7B-Chat | 62.6 | 14.7 | 66.4 |
DeepSeekMoE-16B-Chat | 62.2 | 15.2 | 67.9 |
DeepSeek-V2-Lite-Chat | 72.0 | 27.9 | 71.7 |
Arctic-128×3.66B(MoE-480B) | 74.2 |