Transformer-Based Language Model Surprisal Predicts Human Reading Times Best with About Two Billion Training Tokens

Oh, Byung-Doh; Schuler, William

Computer Science > Computation and Language

arXiv:2304.11389 (cs)

[Submitted on 22 Apr 2023 (v1), last revised 22 Oct 2023 (this version, v2)]

Title:Transformer-Based Language Model Surprisal Predicts Human Reading Times Best with About Two Billion Training Tokens

Authors:Byung-Doh Oh, William Schuler

View PDF

Abstract:Recent psycholinguistic studies have drawn conflicting conclusions about the relationship between the quality of a language model and the ability of its surprisal estimates to predict human reading times, which has been speculated to be due to the large gap in both the amount of training data and model capacity across studies. The current work aims to consolidate these findings by evaluating surprisal estimates from Transformer-based language model variants that vary systematically in the amount of training data and model capacity on their ability to predict human reading times. The results show that surprisal estimates from most variants with contemporary model capacities provide the best fit after seeing about two billion training tokens, after which they begin to diverge from humanlike expectations. Additionally, newly-trained smaller model variants reveal a 'tipping point' at convergence, after which the decrease in language model perplexity begins to result in poorer fits to human reading times. These results suggest that the massive amount of training data is mainly responsible for the poorer fit achieved by surprisal from larger pre-trained language models, and that a certain degree of model capacity is necessary for Transformer-based language models to capture humanlike expectations.

Comments:	Findings of the Association for Computational Linguistics: EMNLP 2023
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2304.11389 [cs.CL]
	(or arXiv:2304.11389v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2304.11389

Submission history

From: Byung-Doh Oh [view email]
[v1] Sat, 22 Apr 2023 12:50:49 UTC (1,585 KB)
[v2] Sun, 22 Oct 2023 20:03:54 UTC (2,199 KB)

Computer Science > Computation and Language

Title:Transformer-Based Language Model Surprisal Predicts Human Reading Times Best with About Two Billion Training Tokens

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Transformer-Based Language Model Surprisal Predicts Human Reading Times Best with About Two Billion Training Tokens

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators