MMLU-Chinese(Measuring Massive Multitask Language Understanding 中文翻译版)

最近大模型能力的测评越来越重要，其中MMLU是一个很关键的数据集合。但是MMLU只能提供英文的测评，如果测试想测试中文的能力怎么办？我这里给出了一个很取巧的方案，就是将MMLU的问题和答案翻译成中文，这样就可以测试中文大模型的能力。

⚠️ 注意

本项目还比较粗糙，测评得分仅供参考。

目前的进度：

一共有57个类别。
Fewshot的数量: 57 * 5。目前已经整理完成
测试问题数量: 57 * 10。目前每个类别的问题只完成了10个，每个类别的问题100-200个不等。所以目前只完成了其中一小部分。

Models support matrix

目前支持如下的模型

Model	Support	执行命令	score(English prompt)	score(中文Prompt)
LLaMA	✅	CUDA_VISIBLE_DEVICES=0 python evaluate_llama.py -m decapoda-research/llama-7b-hf -s llama_result	0.28	0.297
Bloomz	✅	CUDA_VISIBLE_DEVICES=0 python evaluate_bloomz.py -m bigscience/bloomz-7b1-mt -s bloom_result	0.345	0.362
ChatGLM	✅	CUDA_VISIBLE_DEVICES=0 python evaluate_chatglm.py -m THUDM/chatglm-6b -s chatglm_result	0.321	0.310

LLaMA模型目前还未放入transformer的主干，可以使用 zphang的版本
LLaMA由于没有经过指令微调，且本身训练语料中文较少，所以可能效果差一些。

当前问题

由于是机翻+个人翻译，加上很多领域并不熟悉，所以可能会有错误。如果有错误欢迎指出。

TODO

翻译更多的问题
多卡支持

交流

可关注下面公众号，回复"交流群"进群

下面是Fork前的内容, 主要针对英文

This is the repository for Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021).

This repository contains OpenAI API evaluation code, and the test is available for download here.

Test Leaderboard

If you want to have your model added to the leaderboard, please reach out to us or submit a pull request.

Results of the test:

Model	Authors	Humanities	Social Sciences	STEM	Other	Average
Chinchilla (70B, few-shot)	Hoffmann et al., 2022	63.6	79.3	54.9	73.9	67.5
Gopher (280B, few-shot)	Rae et al., 2021	56.2	71.9	47.4	66.1	60.0
GPT-3 (175B, fine-tuned)	Brown et al., 2020	52.5	63.9	41.4	57.9	53.9
flan-T5-xl	Chung et al., 2022	46.3	57.7	39.0	55.1	49.3
UnifiedQA	Khashabi et al., 2020	45.6	56.6	40.2	54.6	48.9
GPT-3 (175B, few-shot)	Brown et al., 2020	40.8	50.4	36.7	48.8	43.9
GPT-3 (6.7B, fine-tuned)	Brown et al., 2020	42.1	49.2	35.1	46.9	43.2
flan-T5-large	Chung et al., 2022	39.1	49.1	33.2	47.4	41.9
flan-T5-base	Chung et al., 2022	34.0	38.1	27.6	37.0	34.2
GPT-2	Radford et al., 2019	32.8	33.3	30.2	33.1	32.4
flan-T5-small	Chung et al., 2022	29.9	30.9	27.5	29.7	29.5
Random Baseline	N/A	25.0	25.0	25.0	25.0	25.0

Citation

Please cite us when using our code, data or model.

@misc{MMLU_Chinese,
  author = {Huang Chao},
  title = {Measuring Massive Multitask Language Understanding in Chinese},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/chaoswork/MMLU_Chinese}},
}

If you find this useful in your research, please consider citing the test and also the ETHICS dataset it draws from:

@article{hendryckstest2021,
  title={Measuring Massive Multitask Language Understanding},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}

@article{hendrycks2021ethics,
  title={Aligning AI With Shared Human Values},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
data		data
images		images
LICENSE		LICENSE
README.md		README.md
calib_tools.py		calib_tools.py
categories.py		categories.py
configuration_chatglm.py		configuration_chatglm.py
crop.py		crop.py
evaluate.py		evaluate.py
evaluate_bloomz.py		evaluate_bloomz.py
evaluate_chatglm.py		evaluate_chatglm.py
evaluate_flan.py		evaluate_flan.py
evaluate_llama.py		evaluate_llama.py
modeling_chatglm.py		modeling_chatglm.py
test_calibration.py		test_calibration.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMLU-Chinese(Measuring Massive Multitask Language Understanding 中文翻译版)

⚠️ 注意

目前的进度：

Models support matrix

目前支持如下的模型

当前问题

TODO

交流

下面是Fork前的内容, 主要针对英文

Test Leaderboard

Citation

About

Releases

Packages

Languages

License

chaoswork/MMLU_Chinese

Folders and files

Latest commit

History

Repository files navigation

MMLU-Chinese(Measuring Massive Multitask Language Understanding 中文翻译版)

⚠️ 注意

目前的进度：

Models support matrix

目前支持如下的模型

当前问题

TODO

交流

下面是Fork前的内容, 主要针对英文

Test Leaderboard

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages