general update

- Remove Magpie due to inconsistent results - Reformat data prep kit - Additional information - Rework "complexity" characteristic
mlabonne · Jun 22, 2024 · 87be4c6 · 87be4c6
1 parent 14e9f94
commit 87be4c6
Showing 1 changed file with 10 additions and 9 deletions.
diff --git a/README.md b/README.md
@@ -16,9 +16,9 @@ Data is the most valuable asset in LLM development. While datasets can't be dire
 
 * **Accuracy**: Samples should be factually correct, helpful to users, and well-written. Answers should also be relevant to their corresponding instructions.
 * **Diversity**: You want to cover as many use cases as possible to ensure proper instruction-following and relevant answers. This requires a wide range of topics, contexts, lengths, writing styles, etc. sampled in a representative way.
-* **Complexity**: The dataset should be representative of the language and tasks you expect the model to handle. It should include common language usage (everyday language), real-world scenarios (e.g., books, articles, websites, social media, conversation transcripts), and a variety of text lengths to help the model handle various input sizes.
+* **Complexity**: Answers should be nontrivial and a/ representative of tasks you expect the model to handle or b/ include complex tasks involving multi-step reasoning, planning, etc. 
 
-Measuring accuracy can be easy in the case of mathematical problems using a Python interpreter, or near-impossible with open-ended, subjective questions. On the other hand, clustering datasets by topic is a good way of measuring diversity. Finally, complexity is difficult to assess without involving frontier models.
+Measuring accuracy can be easy in the case of mathematical problems using a Python interpreter, or near-impossible with open-ended, subjective questions. On the other hand, clustering datasets by topic is a good way of measuring diversity. Finally, complexity can be assessed using other LLMs acting like judges.
 
 ## 📅 Open SFT datasets
 
@@ -39,7 +39,6 @@ The goal of general-purpose datasets is to transform base models into versatile
 | [OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5)    | 1M    | Teknium | Nov 2023 | Another large-scale dataset used by the OpenHermes models. |
 | [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca)              | 518k  | Lian et al. | Sep 2023 | Curated subset of [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) using GPT-4-as-a-judge to remove wrong answers. |
 | [Tulu V2 Mix](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture)  | 326k  | Ivison et al. | Nov 2023 | Mix of high-quality datasets. See [Tulu 2 paper](https://arxiv.org/abs/2311.10702). |
-| 🆕 [Magpie-Pro](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-300K-Filtered)  | 300k  | Xu et al. | Jun 2024 | High-quality samples directly extracted from Llama 3 70B Instruct via a new technique. See [Magpie paper](https://arxiv.org/abs/2406.08464). |
 | [UltraInteract SFT](https://huggingface.co/datasets/openbmb/UltraInteract_sft) | 289k  | Yuan et al. | Apr 2024 | Focus on math, coding, and logic tasks with step-by-step answers. See [Eurus paper](https://arxiv.org/abs/2404.02078). |
 | [NeurIPS-LLM-data](https://huggingface.co/datasets/upaya07/NeurIPS-LLM-data)   | 204k  | Jindal et al. | Nov 2023 | Winner of [NeurIPS LLM Efficiency Challenge](https://llm-efficiency-challenge.github.io/), with an interesting data preparation strategy. |
 | [UltraChat 200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) | 200k  | Tunstall et al., Ding et al. | Oct 2023 | Heavily filtered version of the [UItraChat](https://github.com/thunlp/UltraChat) dataset, consisting of 1.4M dialogues generated by ChatGPT. |
@@ -92,7 +91,7 @@ Many datasets focus on pairs of instructions and outputs, but chat models are of
 | [Bluemoon](https://huggingface.co/datasets/Squish42/bluemoon-fandom-1-1-rp-cleaned) | 290k  | Squish42                | Jun 2023 | Posts from the Blue Moon roleplaying forum cleaned and scraped by a third party.                              |
 | [PIPPA](https://huggingface.co/datasets/kingbri/PIPPA-shareGPT)                     | 16.8k | Gosling et al., kingbri | Aug 2023 | Deduped version of Pygmalion's [PIPPA](https://huggingface.co/datasets/PygmalionAI/PIPPA) in ShareGPT format. |
 | [Capybara](https://huggingface.co/datasets/LDJnr/Capybara)                          | 16k   | LDJnr                   | Dec 2023 | Strong focus on information diversity across a wide range of domains with multi-turn conversations.           |
-| [RPGPT_PublicDomain-alpaca](https://huggingface.co/datasets/practical-dreamer/RPGPT_PublicDomain-alpaca)                        | 4.26k | practical dreamer                   | May 2023 | Synthetic dataset of public domain character dialogue in roleplay format made with [build-a-dataset](https://github.com/practical-dreamer/build-a-dataset)                                        |
+| [RPGPT_PublicDomain-alpaca](https://huggingface.co/datasets/practical-dreamer/RPGPT_PublicDomain-alpaca) | 4.26k | practical dreamer                   | May 2023 | Synthetic dataset of public domain character dialogue in roleplay format made with [build-a-dataset](https://github.com/practical-dreamer/build-a-dataset)                                        |
 | [Pure-Dove](https://huggingface.co/datasets/LDJnr/Pure-Dove)                        | 3.86k | LDJnr                   | Sep 2023 | Highly filtered multi-turn conversations between GPT-4 and real humans                                        |
 | [Opus Samantha](https://huggingface.co/datasets/macadeliccc/opus_samantha)          | 1.85k | macadelicc              | Apr 2024 | Multi-turn conversations with Claude 3 Opus.                                                                  |
 | [LimaRP-augmented](https://huggingface.co/datasets/grimulkan/LimaRP-augmented)      | 804   | lemonilia, grimulkan    | Jan 2024 | Augmented and cleansed version of LimaRP, consisting of human roleplaying conversations.                      |
@@ -104,6 +103,7 @@ Function calling allows large language models (LLMs) to execute predefined funct
 | Dataset                                                                                           | #     | Authors         | Date     | Notes                                                                               |
 | ------------------------------------------------------------------------------------------------- | ----- | --------------- | -------- | ----------------------------------------------------------------------------------- |
 | [glaive-function-calling-v2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2) | 113k  | Sahil Chaudhary | Sep 2023 | High-quality dataset with pairs of instructions and answers in different languages. <br>See [Locutusque/function-calling-chatml](https://huggingface.co/datasets/Locutusque/function-calling-chatml) for a variant without conversation tags. |
+| [xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)                          | 60k   | Salesforce      | Jun 2024 | Samples created using a data generation pipeline designed to produce verifiable data for function-calling applications |
 | [Agent-FLAN](https://huggingface.co/datasets/internlm/Agent-FLAN)                                 | 34.4k | internlm        | Mar 2024 | Mix of AgentInstruct, ToolBench, and ShareGPT datasets.                             |
 
 ## ⚖️ Preference alignment
@@ -115,11 +115,14 @@ W.I.P.
 To create a high-quality dataset, focus on carefully curating a diverse set of relevant, accurate and informative examples rather than simply maximizing dataset size.
 
 Start by aggregating available data from various sources (open-source or not) and apply filters like data deduplication and data quality. If the initial dataset is small or insufficient, consider synthetically generating additional data that mirrors its quality and style. Iteratively explore and refine the dataset by assessing model performance, identifying gaps and collecting or generate data to address those shortcomings.
-### Data deduplication
 
+### Data filtering
+
+[**Data Prep Kit**](https://github.com/IBM/data-prep-kit): Framwork for data preparation for both code and language, with modules in Python, Ray, and Spark, and a wide range of scale from laptops to data centers.
 * **MinHash**: fuzzy deduplication with hashing, sorting, and Jaccard similarity (preferred technique).
 * **BLOOM filters**: fuzzy deduplication with hashing and fixed-size vector.
 * **Sentence deduplication**: exact sentence matching.
+* **Decontamination**: remove samples that are too close to test sets, using either exact or fuzzy filtering.
 
 ### Data quality
 
@@ -137,16 +140,14 @@ Start by aggregating available data from various sources (open-source or not) an
 ### Data generation
 
 * [**Distilabel**](https://github.com/argilla-io/distilabel): General-purpose framework that can generate and augment data (SFT, DPO) with techniques like UltraFeedback and DEITA.
+* [**llm-swarm**](https://github.com/huggingface/llm-swarm): Generate synthetic datasets for pretraining or fine-tuning using either local LLMs or Inference Endpoints on the Hugging Face Hub.
 * [**Auto Data**](https://github.com/Itachi-Uchiha581/Auto-Data): Lightweight library to automatically generate fine-tuning datasets with API models.
 * [**Bonito**](https://github.com/BatsResearch/bonito): Library for generating synthetic instruction tuning datasets for your data without GPT (see also [AutoBonito](https://colab.research.google.com/drive/1l9zh_VX0X4ylbzpGckCjH5yEflFsLW04?usp=sharing)).
 * [**Augmentoolkit**](https://github.com/e-p-armstrong/augmentoolkit): Framework to convert raw text into datasets using open-source and closed-source models.
-
-### Data preparation 
-* [**Data Prep Kit**](https://github.com/IBM/data-prep-kit): Data Prep Kit is a community project to democratize and accelerate unstructured data preparation for LLM app developers. It offers [data preparation capabilities](https://github.com/IBM/data-prep-kit/tree/dev/transforms) for both Code and Language modalities. The goal is to offer high-level APIs for developers to quickly get started in working with their data, without needing expertise in the underlying runtimes and frameworks, thereby reducing time to value. The toolkit supports a growing number of data preparation modules across python, ray, and spark runtimes. It also supports a wide range of scale from a laptop to an entire data centre. The tool also supports KFP based implementations to support no code data processing. The toolkit has a nice [getting started](https://github.com/IBM/data-prep-kit/tree/dev?tab=readme-ov-file#-getting-started-) section that has various examples to get started with. 
 
 ## Acknowledgments
 
-Special thanks to [geronimi73](https://github.com/geronimi73) for the PR.
+Special thanks to [geronimi73](https://github.com/geronimi73) and [Bytes-Explorer](https://github.com/Bytes-Explorer) for their PRs.
 
 ## References