Language Modeling Is Compression

Delétang, Grégoire; Ruoss, Anian; Duquenne, Paul-Ambroise; Catt, Elliot; Genewein, Tim; Mattern, Christopher; Grau-Moya, Jordi; Wenliang, Li Kevin; Aitchison, Matthew; Orseau, Laurent; Hutter, Marcus; Veness, Joel

Computer Science > Machine Learning

arXiv:2309.10668 (cs)

[Submitted on 19 Sep 2023 (v1), last revised 18 Mar 2024 (this version, v2)]

Title:Language Modeling Is Compression

Authors:Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, Marcus Hutter, Joel Veness

View PDF HTML (experimental)

Abstract:It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)
Cite as:	arXiv:2309.10668 [cs.LG]
	(or arXiv:2309.10668v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2309.10668

Submission history

From: Anian Ruoss [view email]
[v1] Tue, 19 Sep 2023 14:50:38 UTC (2,092 KB)
[v2] Mon, 18 Mar 2024 23:15:47 UTC (2,356 KB)

Computer Science > Machine Learning

Title:Language Modeling Is Compression

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Language Modeling Is Compression

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators