Masked Structural Growth for 2x Faster Language Model Pre-training

Yao, Yiqun; Zhang, Zheng; Li, Jing; Wang, Yequan

Computer Science > Computation and Language

arXiv:2305.02869 (cs)

[Submitted on 4 May 2023 (v1), last revised 6 Apr 2024 (this version, v3)]

Title:Masked Structural Growth for 2x Faster Language Model Pre-training

Authors:Yiqun Yao, Zheng Zhang, Jing Li, Yequan Wang

View PDF HTML (experimental)

Abstract:Accelerating large language model pre-training is a critical issue in present research. In this paper, we focus on speeding up pre-training by progressively growing from a small Transformer structure to a large one. There are two main research problems associated with progressive growth: determining the optimal growth schedule, and designing efficient growth operators. In terms of growth schedule, the impact of each single dimension on a schedule's efficiency is under-explored by existing work. Regarding the growth operators, existing methods rely on the initialization of new weights to inherit knowledge, and achieve only non-strict function preservation, limiting further improvements on training dynamics. To address these issues, we propose Masked Structural Growth (MSG), including (i) growth schedules involving all possible dimensions and (ii) strictly function-preserving growth operators that is independent of the initialization of new weights. Experiments show that MSG is significantly faster than related work: we achieve up to 2.2x speedup in pre-training different types of language models while maintaining comparable or better downstream performances. Code is publicly available at this https URL.

Comments:	ICLR 2024 camera ready
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2305.02869 [cs.CL]
	(or arXiv:2305.02869v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2305.02869

Submission history

From: Yiqun Yao [view email]
[v1] Thu, 4 May 2023 14:28:39 UTC (316 KB)
[v2] Fri, 8 Mar 2024 08:54:08 UTC (555 KB)
[v3] Sat, 6 Apr 2024 06:18:26 UTC (555 KB)

Computer Science > Computation and Language

Title:Masked Structural Growth for 2x Faster Language Model Pre-training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Masked Structural Growth for 2x Faster Language Model Pre-training

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators