Skip to content

Commit

Permalink
Add new emojis
Browse files Browse the repository at this point in the history
  • Loading branch information
mlabonne committed May 15, 2024
1 parent 255875d commit 52ed7bb
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@ The goal of general-purpose datasets is to transform base models into versatile

| Dataset | # | Authors | Date | Notes |
| ------------------------------------------------------------------------------------------------------------- | ----- | ---------------------------- | -------- | --------------------------------------------------------------------------------- |
| [Buzz](https://huggingface.co/datasets/H-D-T/Buzz) | 31.2M | Alignment Lab AI | May 2024 | Huge collection of 435 datasets with data augmentation, deduplication, and other techniques. |
| [WebInstructSub](https://huggingface.co/datasets/TIGER-Lab/WebInstructSub) | 2.39M | Yue et al. | May 2024 | Instructions created by retrieving document from Common Crawl, extracting QA pairs, and refining them. See the [MAmmoTH2 paper](https://arxiv.org/abs/2405.03548) (this is a subset). |
| 🆕 [Buzz](https://huggingface.co/datasets/H-D-T/Buzz) | 31.2M | Alignment Lab AI | May 2024 | Huge collection of 435 datasets with data augmentation, deduplication, and other techniques. |
| 🆕 [WebInstructSub](https://huggingface.co/datasets/TIGER-Lab/WebInstructSub) | 2.39M | Yue et al. | May 2024 | Instructions created by retrieving document from Common Crawl, extracting QA pairs, and refining them. See the [MAmmoTH2 paper](https://arxiv.org/abs/2405.03548) (this is a subset). |
| [Bagel](https://github.com/jondurbin/bagel) | >2M? | Jon Durbin | Jan 2024 | Collection of datasets decontaminated with cosine similarity. |
| [Hercules v4.5](https://huggingface.co/datasets/Locutusque/hercules-v4.5) | 1.72M | Sebastian Gabarain | Apr 2024 | Large-scale general-purpose dataset with math, code, RP, etc. See [v4](https://huggingface.co/datasets/Locutusque/hercules-v4.0) for the list of datasets. |
| [Dolphin-2.9](https://huggingface.co/datasets/cognitivecomputations/Dolphin-2.9) | 1.39M | Cognitive Computations | Apr 2023 | Large-scale general-purpose dataset used by the Dolphin models. |
Expand All @@ -47,7 +47,7 @@ The goal of general-purpose datasets is to transform base models into versatile
| [Synthia-v1.3](https://huggingface.co/datasets/migtissera/Synthia-v1.3) | 119k | Migel Tissera | Nov 2023 | High-quality synthetic data generated using GPT-4. |
| [FuseChat-Mixture](https://huggingface.co/datasets/FuseAI/FuseChat-Mixture) | 95k | Wan et al. | Feb 2024 | Selection of samples from high-quality datasets. See [FuseChat paper](https://arxiv.org/abs/2402.16107). |
| [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) | 84.4k | Köpf et al. | Mar 2023 | Human-generated assistant-style conversation corpus in 35 different languages. See [OASST1 paper](https://arxiv.org/abs/2304.07327) and [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2). |
| [WizardLM_evol_instruct_70k](https://huggingface.co/datasets/mlabonne/WizardLM_evol_instruct_v2_70K-ShareGPT) | 70k | Xu et al. | Apr 2023 | Evol-Instruct applied to Alpaca and ShareGPT data. See [WizardLM paper](https://arxiv.org/abs/2304.12244). |
| [WizardLM_evol_instruct_70k](https://huggingface.co/datasets/mlabonne/WizardLM_evol_instruct_70k-ShareGPT) | 70k | Xu et al. | Apr 2023 | Evol-Instruct applied to Alpaca and ShareGPT data. See [WizardLM paper](https://arxiv.org/abs/2304.12244). |
| [airoboros-3.2](https://huggingface.co/datasets/jondurbin/airoboros-3.2) | 58.7k | Jon Durbin | Dec 2023 | High-quality uncensored dataset. |
| [ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) | 53k | anon823 1489123 | Mar 2023 | Filtered version of the ShareGPT dataset, consisting of real conversations between users and ChatGPT. |
| [lmsys-chat-1m-smortmodelsonly](https://huggingface.co/datasets/Nebulous/lmsys-chat-1m-smortmodelsonly) | 45.8k | Nebulous, Zheng et al. | Sep 2023 | Filtered version of [lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) with responses from GPT-4, GPT-3.5-turbo, Claude-2, Claude-1, and Claude-instant-1. |
Expand Down

0 comments on commit 52ed7bb

Please sign in to comment.