This repo contains the human evaluation results and LLM evaluation results in our paper "Can Large Language Models Be an Alternative to Human Evaluation?" We only release the results for open-ended story generation since it is the main experiment in our paper.
Files:
.
├── curie-temp-1-top_p-0.9.json
├── davinci-explain-temp-1-top_p-0.9.json
├── davinci-persona-temp-1-top_p-0.9.json
├── davinci-temp-1-top_p-0.9.json
├── human_eval_result.json
└── readme.md
- The human evaluation results are in
human_eval_result.json
. The huaman evaluation results are not the one in the Section 3; they are the results of mixing the stories written by humans and generated by GPT2. (Please refer to Appendix C.1 for more details.) The numbers may be slightly differnt from those in the paper because we find that some ratings are missing, so we re-recruite the same raters to rate the stories with missing ratings. The current results are complete. - All other
.json
files are the results of querying GPT-3.5 models in January 2023. Each file is a different ablation used in Section 3. Refer to the name of the files to see which ablation it corresponds to. - We do not release the results of ChatGPT's evaluation since we use the UI of ChatGPT for LLM evaluation. (There was no ChatGPT API at the time of writing the paper.)
The data in this repo (the ratings of LLM and humans) are released under Apache License 2.0.
If you use the results of our paper, please cite our paper. It will also be great if you can cite our paper if you are using LLM for evaluation.
@inproceedings{chiang-lee-2023-large,
title = "Can Large Language Models Be an Alternative to Human Evaluations?",
author = "Chiang, Cheng-Han and
Lee, Hung-yi",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.870",
pages = "15607--15631",
abstract = "Human evaluation is indispensable and inevitable for assessing the quality of texts generated by machine learning models or written by humans. However, human evaluation is very difficult to reproduce and its quality is notoriously unstable, hindering fair comparisons among different natural language processing (NLP) models and algorithms.Recently, large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.In this paper, we explore if such an ability of the LLMs can be used as an alternative to human evaluation.We present the LLMs with the exact same instructions, samples to be evaluated, and questions used to conduct human evaluation, and then ask the LLMs to generate responses to those questions; we dub this LLM evaluation.We use human evaluation and LLM evaluation to evaluate the texts in two NLP tasks: open-ended story generation and adversarial attacks.We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation: the texts rated higher by human experts are also rated higher by the LLMs.We also find that the results of LLM evaluation are stable over different formatting of the task instructions and the sampling algorithm used to generate the answer.We are the first to show the potential of using LLMs to assess the quality of texts and discuss the limitations and ethical considerations of LLM evaluation.",
}