Merge pull request #1 from geronimi73/add

add a few, fix hrefs
mlabonne · Apr 29, 2024 · b70f450 · b70f450
2 parents ed98d7d + f29e502
commit b70f450
Showing 1 changed file with 5 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -24,7 +24,7 @@ Measuring accuracy can be easy in the case of mathematical problems using a Pyth
 
 Once a model has been pre-trained on a next-token prediction task, supervised fine-tuning is used to turn it into an assistant capable of answering questions and achieving tasks. These datasets contain pairs of instructions and outputs to train LLMs to go beyond their pre-training objective. All the datasets listed here should be under permissive licensing (Apache 2.0, MIT, cc-by-4.0, etc.).
 
-### General-purpose
+### General-purpose (sorted by size from largest to smallest )
 
 The goal of general-purpose datasets is to transform base models into versatile and capable assistants by exposing them to a wide range of high-quality data. These datasets often include a diverse mix of real-world and synthetic data, commonly generated using models like GPT-4.
 
@@ -40,6 +40,7 @@ The goal of general-purpose datasets is to transform base models into versatile
 | [NeurIPS-LLM-data](https://huggingface.co/datasets/upaya07/NeurIPS-LLM-data)                                  | 204k  | Jindal et al.                | Nov 2023 | Winner of [NeurIPS LLM Efficiency Challenge](https://llm-efficiency-challenge.github.io/), with an interesting data preparation strategy.                                                               |
 | [UltraChat 200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)                                | 200k  | Tunstall et al., Ding et al. | Oct 2023 | Heavily filtered version of the [UItraChat](https://github.com/thunlp/UltraChat) dataset, consisting of 1.4M dialogues generated by ChatGPT.                                                            |
 | [WizardLM_evol_instruct_V2](https://huggingface.co/datasets/mlabonne/WizardLM_evol_instruct_v2_196K-ShareGPT) | 143k  | Xu et al.                    | Jun 2023 | Latest version of Evol-Instruct applied to Alpaca and ShareGPT data. See [WizardLM paper](https://arxiv.org/abs/2304.12244).                                                                            |
+| [sft_datablend_v1](https://huggingface.co/datasets/nvidia/sft_datablend_v1)                                   | 128k  | NVIDIA                       | Jan 2024 | Blend of publicly available datasets: OASST, CodeContests, FLAN, T0, Open_Platypus, and GSM8K and others (45 total).                                                                                    |
 | [Synthia-v1.3](https://huggingface.co/datasets/migtissera/Synthia-v1.3)                                       | 119k  | Migel Tissera                | Nov 2023 | High-quality synthetic data generated using GPT-4.                                                                                                                                                      |
 | [FuseChat-Mixture](https://huggingface.co/datasets/FuseAI/FuseChat-Mixture)                                   | 95k   | Wan et al.                   | Feb 2024 | Selection of samples from high-quality datasets. See [FuseChat paper](https://arxiv.org/abs/2402.16107).                                                                                                |
 | [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1)                                                | 84.4k | Köpf et al.                  | Mar 2023 | Human-generated assistant-style conversation corpus in 35 different languages. See [OASST1 paper](https://arxiv.org/abs/2304.07327) and [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2). |
@@ -48,6 +49,7 @@ The goal of general-purpose datasets is to transform base models into versatile
 | [ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered)       | 53k   | anon823 1489123              | Mar 2023 | Filtered version of the ShareGPT dataset, consisting of real conversations between users and ChatGPT.                                                                                                   |
 | [lmsys-chat-1m-smortmodelsonly](https://huggingface.co/datasets/Nebulous/lmsys-chat-1m-smortmodelsonly)       | 45.8k | Nebulous, Zheng et al.       | Sep 2023 | Filtered version of [lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) with responses from GPT-4, GPT-3.5-turbo, Claude-2, Claude-1, and Claude-instant-1.                            |
 | [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus)                                   | 24.9k | Lee et al.                   | Sep 2023 | Collection of datasets that were deduplicated using Sentence Transformers (it contains an NC dataset). See [Platypus paper](https://arxiv.org/abs/2308.07317).                                          |
+| [databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)                       | 15k   | Conover et al.               | May 2023 | Generated by Databricks employees, prompt/response pairs in eight different instruction categories, including the seven outlined in the InstructGPT paper.                                              |
 
 ### Math & Logic
 
@@ -58,6 +60,7 @@ LLMs often struggle with mathematical reasoning and formal logic, which has led
 | [MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA)                  | 395k | Yu et al.    | Dec 2023 | Bootstrap mathematical questions by rewriting them from multiple perspectives. See [MetaMath paper](https://arxiv.org/abs/2309.12284). |
 | [MathInstruct](https://huggingface.co/datasets/TIGER-Lab/MathInstruct)              | 262k | Yue et al.   | Sep 2023 | Compiled from 13 math rationale datasets, six of which are newly curated, and focuses on chain-of-thought and program-of-thought.      |
 | [Orca-Math](https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k) | 200k | Mitra et al. | Feb 2024 | Grade school math world problems generated using GPT4-Turbo. See [Orca-Math paper](https://arxiv.org/pdf/2402.14830.pdf).              |
+| [OpenMathInstruct-1](https://huggingface.co/datasets/nvidia/OpenMathInstruct-1)     | 5.75M | Toshniwal et al.<br>(NVIDIA) | Feb 2024 | Problems from GSM8K and MATH, solutions generated by Mixtral-8x7B                                                                      |
 
 ### Code
 
@@ -95,7 +98,7 @@ Function calling allows large language models (LLMs) to execute predefined funct
 | Dataset                                                                                           | #     | Authors         | Date     | Notes                                                                               |
 | ------------------------------------------------------------------------------------------------- | ----- | --------------- | -------- | ----------------------------------------------------------------------------------- |
 | [Agent-FLAN](https://huggingface.co/datasets/internlm/Agent-FLAN)                                 | 34.4k | internlm        | Mar 2024 | Mix of AgentInstruct, ToolBench, and ShareGPT datasets.                             |
-| [glaive-function-calling-v2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2) | 113k  | Sahil Chaudhary | Sep 2023 | High-quality dataset with pairs of instructions and answers in different languages. |
+| [glaive-function-calling-v2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2) | 113k  | Sahil Chaudhary | Sep 2023 | High-quality dataset with pairs of instructions and answers in different languages. <br>See [Locutusque/function-calling-chatml](https://huggingface.co/datasets/Locutusque/function-calling-chatml) for a variant without conversation tags. |
 
 ## ⚖️ Preference alignment