Unified Scaling Laws for Routed Language Models

Clark, Aidan; Casas, Diego de las; Guy, Aurelia; Mensch, Arthur; Paganini, Michela; Hoffmann, Jordan; Damoc, Bogdan; Hechtman, Blake; Cai, Trevor; Borgeaud, Sebastian; Driessche, George van den; Rutherford, Eliza; Hennigan, Tom; Johnson, Matthew; Millican, Katie; Cassirer, Albin; Jones, Chris; Buchatskaya, Elena; Budden, David; Sifre, Laurent; Osindero, Simon; Vinyals, Oriol; Rae, Jack; Elsen, Erich; Kavukcuoglu, Koray; Simonyan, Karen

Computer Science > Computation and Language

arXiv:2202.01169 (cs)

[Submitted on 2 Feb 2022 (v1), last revised 9 Feb 2022 (this version, v2)]

Title:Unified Scaling Laws for Routed Language Models

View PDF

Abstract:The performance of a language model has been shown to be effectively modeled as a power-law in its parameter count. Here we study the scaling behaviors of Routing Networks: architectures that conditionally use only a subset of their parameters while processing an input. For these models, parameter count and computational requirement form two independent axes along which an increase leads to better performance. In this work we derive and justify scaling laws defined on these two variables which generalize those known for standard language models and describe the performance of a wide range of routing architectures trained via three different techniques. Afterwards we provide two applications of these laws: first deriving an Effective Parameter Count along which all models scale at the same rate, and then using the scaling coefficients to give a quantitative comparison of the three routing techniques considered. Our analysis derives from an extensive evaluation of Routing Networks across five orders of magnitude of size, including models with hundreds of experts and hundreds of billions of parameters.

Comments:	Fixing typos and affiliation clarity
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2202.01169 [cs.CL]
	(or arXiv:2202.01169v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2202.01169

Submission history

From: Aidan Clark [view email]
[v1] Wed, 2 Feb 2022 17:58:52 UTC (974 KB)
[v2] Wed, 9 Feb 2022 11:07:21 UTC (974 KB)

Computer Science > Computation and Language

Title:Unified Scaling Laws for Routed Language Models

Submission history

Access Paper:

References & Citations

1 blog link

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Unified Scaling Laws for Routed Language Models

Submission history

Access Paper:

References & Citations

1 blog link

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators