Efficient Continual Pre-training for Building Domain Specific Large Language Models

Xie, Yong; Aggarwal, Karan; Ahmad, Aitzaz

Computer Science > Computation and Language

arXiv:2311.08545 (cs)

[Submitted on 14 Nov 2023]

Title:Efficient Continual Pre-training for Building Domain Specific Large Language Models

Authors:Yong Xie, Karan Aggarwal, Aitzaz Ahmad

View PDF

Abstract:Large language models (LLMs) have demonstrated remarkable open-domain capabilities. Traditionally, LLMs tailored for a domain are trained from scratch to excel at handling domain-specific tasks. In this work, we explore an alternative strategy of continual pre-training as a means to develop domain-specific LLMs. We introduce FinPythia-6.9B, developed through domain-adaptive continual pre-training on the financial domain. Continual pre-trained FinPythia showcases consistent improvements on financial tasks over the original foundational model. We further explore simple but effective data selection strategies for continual pre-training. Our data selection strategies outperforms vanilla continual pre-training's performance with just 10% of corpus size and cost, without any degradation on open-domain standard tasks. Our work proposes an alternative solution to building domain-specific LLMs from scratch in a cost-effective manner.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2311.08545 [cs.CL]
	(or arXiv:2311.08545v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2311.08545

Submission history

From: Karan Aggarwal [view email]
[v1] Tue, 14 Nov 2023 21:19:14 UTC (487 KB)

Computer Science > Computation and Language

Title:Efficient Continual Pre-training for Building Domain Specific Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Efficient Continual Pre-training for Building Domain Specific Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators