Skip to content

A quick guide (especially) for trending instruction finetuning datasets

License

Notifications You must be signed in to change notification settings

Mars-Wei/LLMDataHub

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 

Repository files navigation

📔 LLMDataHub: Awesome Datasets for LLM Training

Introduction 📄

Large language models (LLMs), such as OpenAI's GPT series, Google's Bard, and Baidu's Wenxin Yiyan, are driving profound technological changes. Recently, with the emergence of open-source large model frameworks like LlaMa and ChatGLM, training an LLM is no longer the exclusive domain of resource-rich companies. Training LLMs by small organizations or individuals has become an important interest in the open-source community, with some notable works including Alpaca, Vicuna, and Luotuo. In addition to large model frameworks, large-scale and high-quality training corpora are also essential for training large language models. Currently, relevant open-source corpora in the community are still scattered. Therefore, the goal of this repository is to continuously collect high-quality training corpora for LLMs in the open-source community.

Training a chatbot LLM that can follow human instruction effectively requires access to high-quality datasets that cover a range of conversation domains and styles. In this repository, we provide a curated collection of datasets specifically designed for chatbot training, including links, size, language, usage, and a brief description of each dataset. Our goal is to make it easier for researchers and practitioners to identify and select the most relevant and useful datasets for their chatbot LLM training needs. Whether you're working on improving chatbot dialogue quality, response generation, or language understanding, this repository has something for you.

Contact 📬

If you want to contribute, you can contact:

Junhao Zhao 📧
Advised by Prof. Wanyun Cui

General Open Access Datasets for Alignment 🟢:

Type Tags 🏷️:

  • SFT: Supervised Finetune
    • Dialog: Each entry contains continuous conversations
    • Pairs: Each entry is an input-output pair
    • Context: Each entry has a context text and related QA pairs
  • PT: pretrain
  • CoT: Chain-of-Thought Finetune
  • RLHF: train reward model in Reinforcement Learning with Human Feedback
Dataset name 📖 🔗 Used by 🤖 Type 🏷️ Language 🌐 Size 📏 Description 🗒️
ultraChat / Dialog English 1.57M dialogs A large scale dialog dataset created by using two ChatGPT, one of which act as the user, another generates response.
ShareGPT_Vicuna_unfiltered Vicuna Pairs Multilingual 53K entries Cleaned ShareGPT dataset.
pku-saferlhf-dataset Beaver RLHF English 10K + 1M The first dataset of its kind and contains 10k instances with safety preferences.
RefGPT-Dataset RefGPT Pairs, Dialog Chinese ~50K entries A Chinese dialog dataset aims at improve the correctness of fact in LLMs (mitigate the hallucination of LLM).
Luotuo-QA-A-CoQA-Chinese Luotuo project Context Chinese 127K QA pairs A dataset built upon translated CoQA. Augmented by using OpenAI API.
Wizard-LM-Chinese-instruct-evol Luotuo project Pairs Chinese ~70K entries Chinese version WizardLM 70K. Answers are obtained by feed translated questions in OpenAI's GPT API and then get responses.
alpaca_chinese_dataset / Pairs Chinese / GPT-4 translated alpaca data includes some complement data (like Chinese poetry, application, etc.). Inspected by human.
Zhihu-KOL Open Assistant Pairs Chinese 1.5GB QA data on well-know Chinese Zhihu QA platform.
Alpaca-GPT-4_zh-cn / Pairs Chinese about 50K entries A Chinese Alpaca-style dataset, generated by GPT-4 originally in Chinese, not translated.
hh-rlhf
on Huggingface
Koala RLHF English 161k pairs
79.3MB
A pairwise dataset for training reward models in reinforcement learning for improving language models' harmlessness and helpfulness.
Panther-dataset_v1 Panther Pairs English 377 entries A dataset comes from the hh-rlhf. It rewrite hh-rlhf into the form of input-output pairs.
Baize Dataset Baize Dialog English 100K dialogs A dialog dataset generated by GPT-4 using self-talking. Questions and topics are collected from Quora, StackOverflow and some medical knowledge source.
h2oai/h2ogpt-fortune2000-personalized h2ogpt Pairs English 11363 entries A instruction finetune developed by h2oai, covered various topics.
SHP StableVicuna,
chat-opt,
, SteamSHP
RLHF English 385K entries An RLHF dataset different from previously mentioned ones, it use scores+timestamps to infer the users' preferences. Covers 18 domains, collected by Stanford.
ELI5 MiniLM series FT,
RLHF
English 270K entries Questions and Answers collected from Reddit, including score. Might be used for RLHF reward model training.
evol_instruct_70k WizardLM Pairs English An instruction finetune dataset derived from Alpaca-52K, using the evolution method in this paper
MOSS SFT data MOSS Pairs,
Dialog
Chinese, English 1.1M entries A conversational dataset collected and developed by MOSS team. It has usefulness, loyalty and harmlessness labels for every data entries.
ShareGPT52K Koala, Stable LLM Pairs Multilingual 52K This dataset comprises conversations collected from ShareGPT, with a specific focus on customized creative conversation.
GPT-4all Dataset GPT-4all Pairs English,
Might have
a translated version
400k entries A combination of some subsets of OIG, P3 and Stackoverflow. Covers topics like general QA, customized creative questions.
COIG / Pairs Chinese,
code
200K entries A Chinese-based dataset. It contains domains like general purpose QA, Chinese exams, code. Its quality is checked by human annotators.
RedPajama-Data-1T RedPajama PT Primarily English 1.2T tokens
5TB
A fully open pretraining dataset follows the LLaMA's method.
OpenAssistant Conversations Dataset (OASST1) OpenAssistant Pairs,
Dialog
Multilingual
(English, Spanish, etc.)
66,497 conversation trees A large, human-written, human-annotated high quality conversation dataset. It aims at making LLM generates more natural response.
Alpaca-COT Phoenix Pairs,
Dialog,
CoT
English / A mixture a many dataset like classic Alpaca dataset, OIG, Guanaco and some CoT(Chain-of-Thought) datasets like FLAN-CoT. May be handy to use.
CBook-150K / PT,
building dataset
Chinese 150K+ books A raw Chinese books dataset. Need some preprocess pipeline.
databricks-dolly-15k
A possible zh-cn version
Dolly2.0 Pairs English 15K+ entries A dataset of human-written prompts and responses, featuring tasks such as open-domain question-answering, brainstorming, summarization, and more.
AlpacaDataCleaned Some Alpaca/ LLaMA-like models Pairs English / Cleaned version of Alpaca, GPT_LLM and GPTeacher.
GPT-4-LLM Dataset Some Alpaca-like models Pairs,
RLHF
English,
Chinese
52K entries for English and Chinese respectively
9K entries unnatural-instruction
NOT the dataset used by GPT-4!! It is generated by GPT-4 and some other LLM for better Pairs and RLHF. It includes instruction data as well as comparison data in RLHF style.
GPTeacher / Pairs English 20k entries A dataset contains targets generated by GPT-4 and includes many of the same seed tasks as the Alpaca dataset, with the addition of some new tasks such as roleplay.
HC3 Koala RLHF English,
Chinese
24322 English
12853 Chinese
A multi-domain, human-vs-ChatGPT comparison dataset. Can be used for reward model training or ChatGPT detector training.
Alpaca data
Download
Alpaca, ChatGLM-finetune-LoRA, Koala Dialog,
Pairs
English 52K entries
21.4MB
A dataset generated by text-davinci-003 to improve language models' ability to follow human instruction.
OIG
OIG-small-chip2
Pythia-Chat-Base-7B, GPT-NeoXT-Chat-Base-20B, Koala Dialog,
Pairs
English,
code
44M entries A large conversational instruction dataset with medium and high quality subsets (OIG-small-chip2) for multi-task learning.
ChatAlpaca data / Dialog,
Pairs
English,
Chinese version coming soon
10k entries
39.5MB
A dataset aims to help researchers develop models for instruction-following in multi-turn conversations.
InstructionWild ColossalChat Pairs English, Chinese 10K enreues A Alpaca-style dataset, but with seed tasks comes from chatgpt screenshot.
Firefly(流萤) Firefly(流萤) Pairs Chinese 1.1M entries
1.17GB
A Chinese instruction-tuning dataset with 1.1 million human-written examples across 23 tasks, but no conversation.
BELLE
0.5M version
1M version
2M version
BELLE series, Chunhua (春华) Pairs Chinese 2.67B in total A Chinese instruction dataset similar to Alpaca data constructed by generating answers from seed tasks, but no conversation.
GuanacoDataset Guanaco Dialog,
Pairs
English,
Chinese,
Japanese
534,530 entries A multilingual instruction dataset for enhancing language models' capabilities in various linguistic tasks, such as natural language understanding and explicit content recognition.
xP3 (and some variant) BLOOMZ, mT0 Pairs Multilingual,
code
79M entries
88GB
An instruction dataset for improving language models' generalization ability, similar to Natural Instruct.
OpenAI WebGPT WebGPT's reward model, Koala RLHF English 19,578 pairs Data set used in WebGPT paper. Used for training reward model in RLHF.
OpenAI Summarization Comparison Koala RLHF English ~93K entries
420MB
A dataset of human feedback which helps training a reward model. The reward model was then used to train a summarization model to align with human preferences.
Natural Instruction
GitHub&Download
tk-instruct series Pairs,
evaluation
Multilingual / A benchmark with over 1,600 tasks with instruction and definition for evaluating and improving language models' multi-task generalization under natural language instruction.

Potential Overlaps ⚠️

We consider row items as subject.

OIG hh-rlhf xP3 natural instruct AlpacaDataCleaned GPT-4-LLM Alpaca-CoT
OIG / contains overlap overlap overlap overlap
hh-rlhf part of / overlap
xP3 overlap / overlap overlap
natural instruct overlap overlap / overlap
AlpacaDataCleaned overlap / overlap overlap
GPT-4-LLM overlap / overlap
Alpaca-CoT overlap overlap overlap overlap overlap overlap /

Open Datasets for Pretraining 🟢 :atom:

Dataset name 📖 🔗 Used by 🤖 Type 🏷️ Language 🌐 Size 📏 Description 🗒️
falcon-refinedweb tiiuae/falcon series PT English / A refined subset of CommonCrawl.
Common Crawl LLaMA (After some process) building datasets,
PT
/ / The most well-known raw dataset, rarely be used directly. One possible preprocess pipeline is CCNet
nlp_Chinese_Corpus / PT,
TF
Chinese / A Chinese pretrain corpus. Includes Wikipedia, Baidu Baike, Baidu QA, some forums QA and news corpus.
The Pile (V1) GLM (partly), LLaMA (partly), GPT-J, GPT-NeoX-20B, Cerebras-GPT 6.7B, OPT-175b PT Multilingual,
code
825GB A diverse open-source language modeling dataset consisting of 22 smaller, high-quality datasets that includes many domains and tasks.
C4
Huggingface dataset
TensorFlow dataset
Google T5 Series, LLaMA PT English 305GB A colossal, cleaned version of Common Crawl's web crawl corpus. Frequently be used.
ROOTS BLOOM PT Multilingual,
code
1.6TB A diverse open-source dataset consisting of sub-datasets like Wikipedia and StackExchange for language modeling.
PushshPairs reddit
paper
OPT-175b PT / / Raw reddit data, one possible processing pipeline in this paper
Gutenberg project LLaMA PT Multilingual / A book dataset, mostly novels. Not be preprocessed.
CLUECorpus / PT,
finetune,
evaluation
Chinese 100GB A Chinese pretraining Corpus sourced from Common Crawl.

Domain-specific Datasets 🟢 💡

Dataset name 📖 Used by 🤖 Type 🏷️ Language 🌐 Size 📏 Description 🗒️
ChatGPT-Jailbreak-Prompts ⚠️RISKY / / English 163KB file size Prompts for bypassing the safety regulation of ChatGPT. Can be use for probing the harmlessness of LLMs
awesome-chinese-legal-resources LaWGPT / Chinese / A collection of Chinese legal data for LLM training.
Long Form / Pairs English 23.7K entries A dataset aims at improving the long text generation ability of LLM.
symbolic-instruction-tuning / Pairs English,
code
796 A dataset focuses on the 'symbolic' tasks: like SQL coding, mathematical computation, etc.
Safety Prompt / Evaluation only Chinese 100k entries Chinese safety prompts for evaluating and improving the safety of LLMs.
Tapir-Cleaned / Pairs English, 116k entries This is a revised version of the DAISLab dataset of PairsTT rules, which has been thoroughly cleaned, scored, and adjusted for the purpose of instruction-tuning
instructional_codesearchnet_python / Pairs English &
Python
192MB This dataset is a template generated instructional Python datastet generated from an annotated version of the code-search-net dataset for the Open-Assistant project.
finance-alpaca / Pairs English 1.3K entries An Alpaca-style dataset but focus on financial topics

Private Datasets 🔴

Dataset name 📖 Used by 🤖 Type 🏷️ Language 🌐 Size 📏 Description 🗒️
WebText(Reddit links) GPT-2 PT English / Data crawled from Reddit and filtered for GPT-2 pretraining.
MassiveText Gopher, Chinchilla PT 99% English, 1% other(including code)
WuDao(悟道) Corpora GLM PT Chinese 200GB A large scale Chinese corpus, Possible component originally open-sourced but not available now.

About

A quick guide (especially) for trending instruction finetuning datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published