Skip to content

Commit

Permalink
Add PubMed Central
Browse files Browse the repository at this point in the history
  • Loading branch information
leogao2 committed Sep 18, 2020
1 parent b13231d commit 560680f
Show file tree
Hide file tree
Showing 3 changed files with 50 additions and 17 deletions.
35 changes: 18 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,23 +5,24 @@ The Pile is (going to be) the world's largest diverse open source language model

| Component | Size |Weight|Epochs|Mean Document Size|
|-----------------|----------|------|-----:|------------------|
|Bibliotik |100.96 GiB|26.84%| 2.971|538.36 KiB |
|ArXiv |56.21 GiB |14.94%| 2.971|46.61 KiB |
|FreeLaw |51.15 GiB |13.60%| 2.971|15.06 KiB |
|OpenWebText |37.03 GiB |9.84% | 2.971|4.84 KiB |
|StackExchange |32.20 GiB |8.56% | 2.971|2.16 KiB |
|PubMed Abstracts |19.26 GiB |5.12% | 2.971|1.30 KiB |
|Wikipedia (en) |17.27 GiB |4.59% | 2.971|3.00 KiB |
|OpenSubtitles |12.98 GiB |3.45% | 2.971|30.48 KiB |
|Literotica |11.60 GiB |3.08% | 2.971|25.69 KiB |
|Gutenberg (PG-19)|10.88 GiB |2.89% | 2.971|398.73 KiB |
|DM Mathematics |7.75 GiB |2.06% | 2.971|47.21 MiB |
|BookCorpus |6.30 GiB |1.68% | 2.971|369.87 KiB |
|Ubuntu IRC |5.52 GiB |1.47% | 2.971|15.96 MiB |
|CORD-19 |4.26 GiB |1.13% | 2.971|25.59 KiB |
|NIH ExPorter |1.89 GiB |0.50% | 2.971|2.11 KiB |
|Enron Emails |901.43 MiB|0.23% | 2.971|1.78 KiB |
|**Total** |376.14 GiB| | |7.47 KiB |
|Bibliotik |100.96 GiB|21.65%| 2.396|538.36 KiB |
|PubMed Central |90.27 GiB |19.35%| 2.396|30.55 KiB |
|ArXiv |56.21 GiB |12.05%| 2.396|46.61 KiB |
|FreeLaw |51.15 GiB |10.97%| 2.396|15.06 KiB |
|OpenWebText |37.03 GiB |7.94% | 2.396|4.84 KiB |
|StackExchange |32.20 GiB |6.90% | 2.396|2.16 KiB |
|PubMed Abstracts |19.26 GiB |4.13% | 2.396|1.30 KiB |
|Wikipedia (en) |17.27 GiB |3.70% | 2.396|3.00 KiB |
|OpenSubtitles |12.98 GiB |2.78% | 2.396|30.48 KiB |
|Literotica |11.60 GiB |2.49% | 2.396|25.69 KiB |
|Gutenberg (PG-19)|10.88 GiB |2.33% | 2.396|398.73 KiB |
|DM Mathematics |7.75 GiB |1.66% | 2.396|47.21 MiB |
|BookCorpus |6.30 GiB |1.35% | 2.396|369.87 KiB |
|Ubuntu IRC |5.52 GiB |1.18% | 2.396|15.96 MiB |
|CORD-19 |4.26 GiB |0.91% | 2.396|25.59 KiB |
|NIH ExPorter |1.89 GiB |0.41% | 2.396|2.11 KiB |
|Enron Emails |901.43 MiB|0.19% | 2.396|1.78 KiB |
|**Total** |466.41 GiB| | |8.75 KiB |



Expand Down
31 changes: 31 additions & 0 deletions datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -593,3 +593,34 @@ def size(self):

def num_docs(self):
return 3562015


class PubMedCentralDataset(Dataset):
def name(self):
return "PubMed Central"

def _download(self):
if not os.path.exists('components/pubmedcentral'):
sh("""
mkdir -p components/pubmedcentral
cd components/pubmedcentral
wget https://eaidata.bmk.sh/data/PMC_extracts.tar.gz
""")
sha256sum('components/pubmedcentral/PMC_extracts.tar.gz')

def documents(self):
self._download()

yield from lmd.Reader('components/pubmedcentral/PMC_extracts.tar.gz').stream_data()

def clean(self):
if os.path.exists('components/pubmedcentral'):
sh("""
rm -rf components/pubmedcentral
""")

def size(self):
return 96929951580

def num_docs(self):
return 3098931
1 change: 1 addition & 0 deletions pile.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@

datasets = [
(BibliotikDataset() , 1. ),
(PubMedCentralDataset(), 1. ),
(ArXivDataset() , 1. ),
(FreeLawDataset() , 1. ),
(OpenWebTextDataset() , 1. ),
Expand Down

0 comments on commit 560680f

Please sign in to comment.