Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning

Chen, Hao; Zhang, Yiming; Zhang, Qi; Yang, Hantao; Hu, Xiaomeng; Ma, Xuetao; Yanggong, Yifan; Zhao, Junbo

Computer Science > Artificial Intelligence

arXiv:2305.09246 (cs)

[Submitted on 16 May 2023]

Title:Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning

Authors:Hao Chen, Yiming Zhang, Qi Zhang, Hantao Yang, Xiaomeng Hu, Xuetao Ma, Yifan Yanggong, Junbo Zhao

View PDF

Abstract:Instruction tuning for large language models (LLMs) has gained attention from researchers due to its ability to unlock the potential of LLMs in following instructions. While instruction tuning offers advantages for facilitating the adaptation of large language models (LLMs) to downstream tasks as a fine-tuning approach, training models with tens of millions or even billions of parameters on large amounts of data results in unaffordable computational costs. To address this, we focus on reducing the data used in LLM instruction tuning to decrease training costs and improve data efficiency, dubbed as Low Training Data Instruction Tuning (LTD Instruction Tuning). Specifically, this paper conducts a preliminary exploration into reducing the data used in LLM training and identifies several observations regarding task specialization for LLM training, such as the optimization of performance for a specific task, the number of instruction types required for instruction tuning, and the amount of data required for task-specific models. The results suggest that task-specific models can be trained using less than 0.5% of the original dataset, with a 2% improvement in performance over those trained on full task-related data.

Comments:	9 pages, 2 tables, 2 figures
Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2305.09246 [cs.AI]
	(or arXiv:2305.09246v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2305.09246

Submission history

From: Hao Chen [view email]
[v1] Tue, 16 May 2023 07:52:57 UTC (249 KB)

Computer Science > Artificial Intelligence

Title:Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators