Skip to content

Collection of training data management explorations for large language models

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



22 Commits

Repository files navigation

Data Management for LLM

A curated list of training data management for large language model resources.



Data Quantity

  • Scaling Laws

    • Scaling Laws for Neural Language Models (Arxiv, Jan. 2020) [Paper]
    • An empirical analysis of compute-optimal large language model training (NeurIPS 2022) [Paper]
  • Data Repetition

    • Scaling Laws and Interpretability of Learning from Repeated Data (Arxiv, May 2022) [Paper]
    • Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning (Arxiv, Oct. 2022) [Paper]
    • Scaling Data-Constrained Language Models (Arxiv, May 2023) [Paper] [Code]
    • To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis (Arxiv, May 2023) [Paper]
    • D4: Improving LLM Pretraining via Document De-Duplication and Diversification (Arxiv, Aug. 2023) [Paper]

Data Quality

  • Deduplication

    • Deduplicating training data makes language models better (ACL 2022) [Paper] [Code]
    • Deduplicating training data mitigates privacy risks in language models (ICML 2022) [Paper]
    • Noise-Robust De-Duplication at Scale (ICLR 2022) [Paper]
    • SemDeDup: Data-efficient learning at web-scale through semantic deduplication (Arxiv, Mar. 2023) [Paper] [Code]
  • Quality Filtering

    • An Empirical Exploration in Quality Filtering of Text Data (Arxiv, Sep. 2021) [Paper]
    • Quality at a glance: An audit of web-crawled multilingual datasets (ACL 2022) [Paper]
    • The MiniPile Challenge for Data-Efficient Language Models (Arxiv, April 2023) [Paper] [Dataset]
    • A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (Arxiv, May 2023) [Paper]
    • Textbooks Are All You Need (Arxiv, Jun. 2023) [Paper] [Code]
    • The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only (NeurIPS Dataset and Benchmark track 2023) [Paper] [Dataset]
    • Textbooks Are All You Need II: phi-1.5 technical report (Arxiv, Sep. 2023) [Paper] [Model]
    • When less is more: Investigating Data Pruning for Pretraining LLMs at Scale (Arxiv, Sep. 2023) [Paper]
    • Ziya2: Data-centric Learning is All LLMs Need (Arxiv, Nov. 2023) [Paper] [Model]
    • DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (Arxiv, Jan. 2024) [Paper]
  • Toxicity Filtering

    • Detoxifying language models risks marginalizing minority voices (NAACL-HLT, 2021) [Paper] [Code]
    • Challenges in detoxifying language models (EMNLP Findings, 2021) [Paper]
    • What’s in the box? a preliminary analysis of undesirable content in the Common Crawl corpus (Arxiv, May 2021) [Paper] [Code]
    • A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (Arxiv, May 2023) [Paper]
  • Social Biases

    • Documenting large webtext corpora: A case study on the Colossal Clean Crawled Corpus (EMNLP 2021) [Paper]
    • An empirical survey of the effectiveness of debiasing techniques for pre-trained language models (ACL, 2022) [Paper] [Code]
    • Whose language counts as high quality? Measuring language ideologies in text data selection (EMNLP, 2022) [Paper] [Code]
    • From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models (ACL 2023) [Paper] [Code]
  • Diversity & Age

    • Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data (Arxiv, Jun. 2023) [Paper]
    • D2 Pruning: Message Passing for Balancing Diversity and Difficulty in Data Pruning (Arxiv, Oct. 2023) [Paper] [Code]
    • A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (Arxiv, May 2023) [Paper]

Domain Composition

  • Lamda: Language models for dialog applications (Arxiv, Jan. 2022) [Paper] [Code]
  • Data Selection for Language Models via Importance Resampling (Arxiv, Feb. 2023) [Paper] [Code]
  • CodeGen2: Lessons for Training LLMs on Programming and Natural Languages (ICLR 2023) [Paper] [Model]
  • DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining (Arxiv, May 2023) [Paper] [Code]
  • A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (Arxiv, May 2023) [Paper]
  • SlimPajama-DC: Understanding Data Combinations for LLM Training (Arxiv, Sep. 2023) [Paper] [Model] [Dataset]
  • DoGE: Domain Reweighting with Generalization Estimation (Arxiv, Oct. 2023) [Paper]

Data Management Systems

  • Data-Juicer: A One-Stop Data Processing System for Large Language Models (Arxiv, Sep. 2023) [Paper] [Code]
  • Oasis: Data Curation and Assessment System for Pretraining of Large Language Models (Arxiv, Nov. 2023) [Paper] [Code]

Supervised Fine-Tuning

Data Quantity

  • Exploring the Impact of Instruction Data Scaling on Large Language Models: An Empirical Study on Real-World Use Cases (Arxiv, Mar. 2023) [Paper]
  • Lima: Less is more for alignment (Arxiv, May 2023) [Paper]
  • Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning (Arxiv, May 2023) [Paper]
  • Scaling Relationship on Learning Mathematical Reasoning with Large Language Models (Arxiv, Aug. 2023) [Paper] [Code]
  • How Abilities In Large Language Models Are Affected By Supervised Fine-Tuning Data Composition (Arxiv, Oct. 2023) [Paper]
  • Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth Pace (Arxiv, Oct. 2023) [Paper]

Data Quality

  • Instruction Quality

    • Self-refine: Iterative refinement with self-feedback (Arxiv, Mar. 2023) [Paper][Project]
    • Lima: Less is more for alignment (Arxiv, May 2023) [Paper]
    • Enhancing Chat Language Models by Scaling High-quality Instructional Conversations (Arxiv, May 2023) [Paper] [Code]
    • SelFee: Iterative Self-Revising LLM Empowered by Self-Feedback Generation (Blog post, May 2023) [Project]
    • INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models (Arxiv, Jun. 2023) [Paper] [Code]
    • Instruction mining: High-quality instruction data selection for large language models (Arxiv, Jul. 2023) [Paper] [Code]
    • Harnessing the Power of David against Goliath: Exploring Instruction Data Generation without Using Closed-Source Models (Arxiv, Aug. 2023) [Paper]
    • Self-Alignment with Instruction Backtranslation (Arxiv. Aug. 2023) [Paper]
    • SELF: Language-Driven Self-Evolution for Large Language Models (Arxiv, Oct. 2023) [Paper]
    • Reflection-Tuning: Recycling Data for Better Instruction-Tuning (NeurIPS 2023 Instruction Workshop) [Paper] [Code]
    • Automatic Instruction Optimization for Open-source LLM Instruction Tuning (Arxiv, Nov. 2023) [Paper] [Code]
  • Instruction Diversity

    • Self-instruct: Aligning language models with self-generated instructions (ACL 2023) [Paper][Code]
    • Stanford Alpaca (Mar. 2023) [Code]
    • Enhancing Chat Language Models by Scaling High-quality Instructional Conversation (Arxiv, May 2023) [Paper] [Code]
    • Lima: Less is more for alignment (Arxiv, May 2023) [Paper]
    • #InsTag: Instruction Tagging for Analyzing Supervised Fine-Tuning of Large Language Models (Arxiv, Aug. 2023) [Paper] [Code]
    • Explore-Instruct: Enhancing Domain-Specific Instruction Coverage through Active Exploration (Arxiv, Oct. 2023) [Paper] [Code]
    • DiffTune: A Diffusion-Based Approach to Diverse Instruction-Tuning Data Generation (NeurIPS 2023) [Paper]
    • Data Diversity Matters for Robust Instruction Tuning (Arxiv, Nov. 2023) [Paper]
  • Instruction Complexity

    • WizardLM: Empowering Large Language Models to Follow Complex Instructions (Arxiv, April 2023) [Paper] [Code]
    • WizardCoder: Empowering Code Large Language Models with Evol-Instruct (Arxiv, Jun. 2023) [Paper] [Code]
    • Orca: Progressive Learning from Complex Explanation Traces of GPT-4 (Arxiv, Jun. 2023) [Paper] [Code]
    • A Preliminary Study of the Intrinsic Relationship between Complexity and Alignment (Arxiv, Aug. 2023) [Paper]
    • #InsTag: Instruction Tagging for Analyzing Supervised Fine-Tuning of Large Language Models (Arxiv, Aug. 2023) [Paper] [Code]
    • Can Large Language Models Understand Real-World Complex Instructions? (Arxiv, Sep. 2023) [Paper] [Benchmark]
    • Followbench: A multi-level fine-grained constraints following benchmark for large language models (Arxiv, Oct. 2023) [Paper] [Code]
  • Prompt Design

    • Reframing instructional prompts to gptk’s language (ACL Findings, 2022) [Paper] [Code]
    • Prompt Waywardness: The Curious Case of Discretized Interpretation of Continuous Prompts (NAACL, 2022) [Paper] [Code]
    • Demystifying Prompts in Language Models via Perplexity Estimation (Arxiv, Dec. 2022) [Paper]
    • Did You Read the Instructions? Rethinking the Effectiveness of Task Definitions in Instruction Learning (ACL, 2023) [Paper] [Code]
    • Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning (ACL, 2023) [Paper]
    • The False Promise of Imitating Proprietary LLMs (Arxiv, May 2023) [Paper]
    • Exploring Format Consistency for Instruction Tuning (Arxiv, Jul. 2023) [Paper]
    • Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning (Arxiv, Oct. 2023) [Paper]
    • Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth Pace (Arxiv, Oct. 2023) [Paper]

Task composition

  • Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ Tasks (EMNLP 2022) [Paper] [Dataset]
  • Finetuned Language Models Are Zero-Shot Learners (ICLR 2022) [Paper] [Dataset]
  • Multitask Prompted Training Enables Zero-Shot Task Generalization (ICLR 2022) [Paper] [Code]
  • Scaling Instruction-Finetuned Language Models (Arxiv, Oct. 2022) [Paper] [Dataset]
  • OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization (Arxiv, Dec. 2022) [Paper] [Model]
  • The Flan Collection: Designing Data and Methods for Effective Instruction Tuning (ICML, 2023) [Paper] [Dataset]
  • Exploring the Benefits of Training Expert Language Models over Instruction Tuning (ICML, 2023) [Paper] [Code]
  • Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning (Arxiv, May 2023) [Paper]
  • How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources (Arxiv, Jun. 2023) [Paper] [Code]
  • How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition (Arxiv, Oct. 2023) [Paper]

Data-Efficient Learning

  • Data Quantity
    • Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning (Arxiv, Jul. 2023) [Paper]
    • From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning (Arxiv, Aug. 2023) [Paper] [Code]
    • How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition (Arxiv, Oct. 2023) [Paper]
  • Data Quality
    • NLU on Data Diets: Dynamic Data Subset Selection for NLP Classification Tasks (SustaiNLP, 2023) [Paper]
    • Instruction Mining: High-Quality Instruction Data Selection for Large Language Models (Arxiv, Jul. 2023) [Paper] [Code]
    • AlpaGasus: Training A Better Alpaca with Fewer Data (Arxiv, Jul. 2023) [Paper]
    • OpenChat: Advancing Open-source Language Models with Mixed-Quality Data (Arxiv, Sep. 2023) [Paper] [Code]
    • Tuna: Instruction Tuning using Feedback from Large Language Models (EMNLP 2023) [Paper] [Code]
    • Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning (Arxiv, Nov. 2023) [Paper] [Code]
    • Data Diversity Matters for Robust Instruction Tuning (Arxiv, Nov. 2023) [Paper]
    • MoDS: Model-oriented Data Selection for Instruction Tuning (Arxiv, Nov. 2023) [Paper] [Code]
    • WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation (Arxiv, Dec. 2023) [Paper]
    • One Shot Learning as Instruction Data Prospector for Large Language Models (Arxiv, Dec. 2023) [Paper]
    • Rethinking the Instruction Quality: LIFT is What You Need (Arxiv, Dec. 2023) [Paper]
    • MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following (Arxiv, Dec. 2023) [Paper]
    • One Shot Learning as Instruction Data Prospector for Large Language Models (Arxiv, Dec. 2023) [Paper]
    • An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models (Arxiv, Jan. 2024) [Paper]
    • Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning (Arxiv, Feb. 2024) [Paper] [Code]
  • Task Composition
    • Data-Efficient Finetuning Using Cross-Task Nearest Neighbors (ACL Findings, 2023) [Paper] [Code]
    • Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation (Arxiv, May 2023) [Paper] [Code]
    • MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning (Arxiv, Sep. 2023) [Paper] [Code]
    • Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks (EMNLP 2023) [Paper] [Code]
  • Others
    • Data-Juicer: A One-Stop Data Processing System for Large Language Models (Arxiv, Sep. 2023) [Paper] [Code]
    • LoBaSS: Gauging Learnability in Supervised Fine-tuning Data (Arxiv, Oct. 2023) [Paper]
    • Contrastive post-training large language models on data curriculum (Arxiv, Oct. 2023)[Paper]
    • IDEAL: Influence-Driven Selective Annotations Empower In-Context Learners in Large Language Models (ICLR 2024) [Paper] [Code]
    • What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning (Arxiv, Dec. 2023)[Paper] [Code]
    • LESS: Selecting Influential Data for Targeted Instruction Tuning (Arxiv, Feb. 2024)[Paper] [Code]

Useful Resources