Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Thakur, Aman Singh; Choudhary, Kartik; Ramayapally, Venkat Srinik; Vaidyanathan, Sankaran; Hupkes, Dieuwke

Computer Science > Computation and Language

arXiv:2406.12624 (cs)

[Submitted on 18 Jun 2024 (v1), last revised 2 Nov 2024 (this version, v4)]

Title:Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Authors:Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, Dieuwke Hupkes

View PDF HTML (experimental)

Abstract:Offering a promising solution to the scalability challenges associated with human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models (LLMs). However, there are still many open questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold. In this paper, we present a comprehensive study of the performance of various LLMs acting as judges, focusing on a clean scenario in which inter-human agreement is high. Investigating thirteen judge models of different model sizes and families, judging answers of nine different 'examtaker models' - both base and instruction-tuned - we find that only the best (and largest) models achieve reasonable alignment with humans. However, they are still quite far behind inter-human agreement and their assigned scores may still differ with up to 5 points from human-assigned scores. In terms of their ranking of the nine exam-taker models, instead, also smaller models and even the lexical metric contains may provide a reasonable signal. Through error analysis and other studies, we identify vulnerabilities in judge models, such as their sensitivity to prompt complexity and length, and a tendency toward leniency. The fact that even the best judges differ from humans in this comparatively simple setup suggest that caution may be wise when using judges in more complex setups. Lastly, our research rediscovers the importance of using alignment metrics beyond simple percent alignment, showing that judges with high percent agreement can still assign vastly different scores.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2406.12624 [cs.CL]
	(or arXiv:2406.12624v4 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.12624

Submission history

From: Aman Singh Thakur [view email]
[v1] Tue, 18 Jun 2024 13:49:54 UTC (4,482 KB)
[v2] Tue, 2 Jul 2024 02:57:01 UTC (4,482 KB)
[v3] Fri, 11 Oct 2024 05:29:18 UTC (3,083 KB)
[v4] Sat, 2 Nov 2024 17:07:06 UTC (3,083 KB)

Computer Science > Computation and Language

Title:Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators