The Pile is (going to be) the world's largest open source language modeling data set. We are currently developing Version 1, with a goal of 1 TiB of English text.
Component | Size | Weight | Epochs (@1.2TB) |
---|---|---|---|
Bibliotik | 100.96 GiB | 47.48% | 5.256 |
OpenWebText | 37.03 GiB | 17.41% | 5.256 |
Wikipedia (en) | 17.27 GiB | 8.12% | 5.256 |
OpenSubtitles | 12.98 GiB | 6.11% | 5.256 |
Gutenberg (PG-19) | 10.88 GiB | 5.12% | 5.256 |
Literotica | 8.81 GiB | 4.14% | 5.256 |
DM Mathematics | 7.75 GiB | 3.64% | 5.256 |
BookCorpus | 6.30 GiB | 2.96% | 5.256 |
Ubuntu IRC | 5.52 GiB | 2.59% | 5.256 |
CORD-19 | 4.26 GiB | 2.00% | 5.256 |
Enron Emails | 901.43 MiB | 0.41% | 5.256 |
Total | 212.63 GiB |
The following components need manual downloading. Either download them or comment out from pile.py
.
- Bibliotik:
books3.tar.gz
needs to be in the current directory. Download temporarily unavailable. - CORD-19:
document_parses
needs to be in the current directory. Download from here.
To propose a new dataset be added to the Pile, open an issue with the tag . Your issue should include a description of the dataset, its size, what language(s) it is in, a link to the data, and any other relevant information. If a project manger approves your proposal, they will change its label to and add it to . Datasets that we elect to not include in the current version of the Pile will receive a or label. While we welcome multilingual datasets and plan on including non-English datasets in the future, the initial release of the Pile will be English-only and all submissions of non-English datasets will be Deferred.
To claim an unclaimed dataset, leave a comment on one of our unassigned issues. Once an dataset has been assigned to you, make the necessary changes to datsets.py
and pile.py
in a fork and submit a pull request. If you require, you can also submit a script for processing the data as shown here.
To raise an issue that is not proposing a new dataset, open an issue with the tag or as appropriate.