add Buzz and WebInstructSub

mlabonne · May 15, 2024 · 255875d · 255875d
1 parent 6c1205e
commit 255875d
Showing 1 changed file with 12 additions and 10 deletions.
diff --git a/README.md b/README.md
@@ -28,16 +28,18 @@ Once a model has been pre-trained on a next-token prediction task, supervised fi
 
 The goal of general-purpose datasets is to transform base models into versatile and capable assistants by exposing them to a wide range of high-quality data. These datasets often include a diverse mix of real-world and synthetic data, commonly generated using models like GPT-4.
 
-| Dataset | # | Authors | Date | Notes |
-| ------------------------------------------------------------------------------------------------------------- | ----- | ---------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| [Bagel](https://github.com/jondurbin/bagel) | >2M? | Jon Durbin | Jan 2024 | Collection of datasets decontaminated with cosine similarity. |
-| [Hercules v4.5](https://huggingface.co/datasets/Locutusque/hercules-v4.5) | 1.72M | Sebastian Gabarain | Apr 2024 | Large-scale general-purpose dataset with math, code, RP, etc. See [v4](https://huggingface.co/datasets/Locutusque/hercules-v4.0) for the list of datasets. |
-| [Dolphin-2.9](https://huggingface.co/datasets/cognitivecomputations/Dolphin-2.9) | 1.39M | Cognitive Computations | Apr 2023 | Large-scale general-purpose dataset used by the Dolphin models. |
-| [WildChat-1M](https://huggingface.co/datasets/allenai/WildChat-1M) | 1.04M | Zhao et al. | May 2023 | Real conversations between human users and GPT-3.5/4, including metadata. See the [WildChat paper](https://arxiv.org/abs/2405.01470). |
-| [OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 1M | Teknium | Nov 2023 | Another large-scale dataset used by the OpenHermes models. |
-| [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) | 518k | Lian et al. | Sep 2023 | Curated subset of [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) using GPT-4-as-a-judge to remove wrong answers. |
-| [Tulu V2 Mix](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) | 326k | Ivison et al. | Nov 2023 | Mix of high-quality datasets. See [Tulu 2 paper](https://arxiv.org/abs/2311.10702). |
-| [UltraInteract SFT](https://huggingface.co/datasets/openbmb/UltraInteract_sft) | 289k | Yuan et al. | Apr 2024 | Focus on math, coding, and logic tasks with step-by-step answers. See [Eurus paper](https://arxiv.org/abs/2404.02078). |
+| Dataset | # | Authors | Date | Notes |
+| ------------------------------------------------------------------------------------------------------------- | ----- | ---------------------------- | -------- | --------------------------------------------------------------------------------- |
+| [Buzz](https://huggingface.co/datasets/H-D-T/Buzz) | 31.2M | Alignment Lab AI | May 2024 | Huge collection of 435 datasets with data augmentation, deduplication, and other techniques. |
+| [WebInstructSub](https://huggingface.co/datasets/TIGER-Lab/WebInstructSub) | 2.39M | Yue et al. | May 2024 | Instructions created by retrieving document from Common Crawl, extracting QA pairs, and refining them. See the [MAmmoTH2 paper](https://arxiv.org/abs/2405.03548) (this is a subset). |
+| [Bagel](https://github.com/jondurbin/bagel) | >2M? | Jon Durbin | Jan 2024 | Collection of datasets decontaminated with cosine similarity. |
+| [Hercules v4.5](https://huggingface.co/datasets/Locutusque/hercules-v4.5) | 1.72M | Sebastian Gabarain | Apr 2024 | Large-scale general-purpose dataset with math, code, RP, etc. See [v4](https://huggingface.co/datasets/Locutusque/hercules-v4.0) for the list of datasets. |
+| [Dolphin-2.9](https://huggingface.co/datasets/cognitivecomputations/Dolphin-2.9) | 1.39M | Cognitive Computations | Apr 2023 | Large-scale general-purpose dataset used by the Dolphin models. |
+| [WildChat-1M](https://huggingface.co/datasets/allenai/WildChat-1M) | 1.04M | Zhao et al. | May 2023 | Real conversations between human users and GPT-3.5/4, including metadata. See the [WildChat paper](https://arxiv.org/abs/2405.01470). |
+| [OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 1M | Teknium | Nov 2023 | Another large-scale dataset used by the OpenHermes models. |
+| [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) | 518k | Lian et al. | Sep 2023 | Curated subset of [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) using GPT-4-as-a-judge to remove wrong answers. |
+| [Tulu V2 Mix](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) | 326k | Ivison et al. | Nov 2023 | Mix of high-quality datasets. See [Tulu 2 paper](https://arxiv.org/abs/2311.10702). |
+| [UltraInteract SFT](https://huggingface.co/datasets/openbmb/UltraInteract_sft) | 289k | Yuan et al. | Apr 2024 | Focus on math, coding, and logic tasks with step-by-step answers. See [Eurus paper](https://arxiv.org/abs/2404.02078). |
 | [NeurIPS-LLM-data](https://huggingface.co/datasets/upaya07/NeurIPS-LLM-data) | 204k | Jindal et al. | Nov 2023 | Winner of [NeurIPS LLM Efficiency Challenge](https://llm-efficiency-challenge.github.io/), with an interesting data preparation strategy. |
 | [UltraChat 200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) | 200k | Tunstall et al., Ding et al. | Oct 2023 | Heavily filtered version of the [UItraChat](https://github.com/thunlp/UltraChat) dataset, consisting of 1.4M dialogues generated by ChatGPT. |
 | [WizardLM_evol_instruct_V2](https://huggingface.co/datasets/mlabonne/WizardLM_evol_instruct_v2_196K-ShareGPT) | 143k | Xu et al. | Jun 2023 | Latest version of Evol-Instruct applied to Alpaca and ShareGPT data. See [WizardLM paper](https://arxiv.org/abs/2304.12244). |