Skip to content

Commit

Permalink
add WildChat
Browse files Browse the repository at this point in the history
  • Loading branch information
mlabonne committed May 4, 2024
1 parent 4463bf6 commit e4a8ecd
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,9 @@ The goal of general-purpose datasets is to transform base models into versatile
| Dataset | # | Authors | Date | Notes |
| ------------------------------------------------------------------------------------------------------------- | ----- | ---------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [Bagel](https://github.com/jondurbin/bagel) | >2M? | Jon Durbin | Jan 2024 | Collection of datasets decontaminated with cosine similarity. |
| [Hercules v4.5](https://huggingface.co/datasets/Locutusque/hercules-v4.5) | 1.72M | Sebastian Gabarain | Apr 2024 | Large-scale general-purpose dataset with math, code, RP, etc. See [v4](https://huggingface.co/datasets/Locutusque/hercules-v4.0) for the list of datasets. |
| [Dolphin-2.9](https://huggingface.co/datasets/cognitivecomputations/Dolphin-2.9) | 1.39M | Cognitive Computations | Apr 2023 | Large-scale general-purpose dataset used by the Dolphin models. |
| [Hercules v4.5](https://huggingface.co/datasets/Locutusque/hercules-v4.5) | 1.72M | Sebastian Gabarain | Apr 2024 | Large-scale general-purpose dataset with math, code, RP, etc. See [v4](https://huggingface.co/datasets/Locutusque/hercules-v4.0) for the list of datasets. |
| [Dolphin-2.9](https://huggingface.co/datasets/cognitivecomputations/Dolphin-2.9) | 1.39M | Cognitive Computations | Apr 2023 | Large-scale general-purpose dataset used by the Dolphin models. |
| [WildChat-1M](https://huggingface.co/datasets/allenai/WildChat-1M) | 1.04M | Zhao et al. | May 2023 | Real conversations between human users and GPT-3.5/4, including demographic data, including state, country, hashed IP addresses, and request headers. |
| [OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 1M | Teknium | Nov 2023 | Another large-scale dataset used by the OpenHermes models. |
| [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) | 518k | Lian et al. | Sep 2023 | Curated subset of [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) using GPT-4-as-a-judge to remove wrong answers. |
| [Tulu V2 Mix](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) | 326k | Ivison et al. | Nov 2023 | Mix of high-quality datasets. See [Tulu 2 paper](https://arxiv.org/abs/2311.10702). |
Expand Down

0 comments on commit e4a8ecd

Please sign in to comment.