G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Liu, Yang; Iter, Dan; Xu, Yichong; Wang, Shuohang; Xu, Ruochen; Zhu, Chenguang

Computer Science > Computation and Language

arXiv:2303.16634 (cs)

[Submitted on 29 Mar 2023 (v1), last revised 23 May 2023 (this version, v3)]

Title:G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Authors:Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu

View PDF

Abstract:The quality of texts generated by natural language generation (NLG) systems is hard to measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity. Recent studies suggest using large language models (LLMs) as reference-free metrics for NLG evaluation, which have the benefit of being applicable to new tasks that lack human references. However, these LLM-based evaluators still have lower human correspondence than medium-size neural evaluators. In this work, we present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm, to assess the quality of NLG outputs. We experiment with two generation tasks, text summarization and dialogue generation. We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin. We also propose preliminary analysis on the behavior of LLM-based evaluators, and highlight the potential issue of LLM-based evaluators having a bias towards the LLM-generated texts. The code is at this https URL

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2303.16634 [cs.CL]
	(or arXiv:2303.16634v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2303.16634

Submission history

From: Yang Liu [view email]
[v1] Wed, 29 Mar 2023 12:46:54 UTC (242 KB)
[v2] Thu, 6 Apr 2023 23:49:08 UTC (209 KB)
[v3] Tue, 23 May 2023 22:12:16 UTC (209 KB)

Computer Science > Computation and Language

Title:G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Submission history

Access Paper:

References & Citations

3 blog links

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Submission history

Access Paper:

References & Citations

3 blog links

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators