Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Press, Ofir; Smith, Noah A.; Lewis, Mike

Computer Science > Computation and Language

arXiv:2108.12409 (cs)

[Submitted on 27 Aug 2021 (v1), last revised 22 Apr 2022 (this version, v2)]

Title:Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Authors:Ofir Press, Noah A. Smith, Mike Lewis

View PDF

Abstract:Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We first show that extrapolation can be enabled by simply changing the position representation method, though we find that current methods do not allow for efficient extrapolation. We therefore introduce a simpler and more efficient position method, Attention with Linear Biases (ALiBi). ALiBi does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance. We show that this method trains a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048 but training 11% faster and using 11% less memory. ALiBi's inductive bias towards recency also leads it to outperform multiple strong position methods on the WikiText-103 benchmark.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2108.12409 [cs.CL]
	(or arXiv:2108.12409v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2108.12409

Submission history

From: Ofir Press [view email]
[v1] Fri, 27 Aug 2021 17:35:06 UTC (187 KB)
[v2] Fri, 22 Apr 2022 18:20:48 UTC (221 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2021-08

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Ofir Press
Noah A. Smith
Mike Lewis

export BibTeX citation

Computer Science > Computation and Language

Title:Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators