Skip to content

Commit

Permalink
added readme
Browse files Browse the repository at this point in the history
  • Loading branch information
soldni committed Jun 30, 2023
1 parent 274348e commit e5be951
Show file tree
Hide file tree
Showing 4 changed files with 35 additions and 3 deletions.
25 changes: 25 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: peS2o
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- family-names: Soldaini
given-names: Luca
email: [email protected]
affiliation: Allen Institute for AI
orcid: 'https://orcid.org/0000-0001-6998-9863'
- given-names: Kyle
family-names: Lo
email: [email protected]
affiliation: Allen Institute for AI
orcid: 'https://orcid.org/0000-0002-1804-2853'
repository-code: 'https://github.com/allenai/peS2o'
url: 'https://huggingface.co/datasets/allenai/pes2o'
abstract: >
The peS2o dataset is a collection of ~40M creative commmon licensed academic papers, cleaned, filtered, and formatted for pre-training of language models. It is derived from S2ORC.
license: Apache-2.0
13 changes: 10 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,20 @@
<p align="center" style="margin-top: -2em">
<img src="https://huggingface.co/datasets/allenai/pes2o/resolve/main/logo.png" alt="peS2o logo. It's a picure of a mortar and pestle with documents flying in." width=384px height=auto>
<img src="res/logo.png" alt="peS2o logo. It's a picure of a mortar and pestle with documents flying in." width=384px height=auto>
</p>
<p align="center" style="font-size: 1.2em; margin-top: -1em"><i>Pretraining Efficiently on <a href="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/allenai/s2orc">S2ORC</a>!</i></p>
<p align="center" style="font-size: 1.2em;">Available on the <a href="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/allenai/s2orc">Huggingface Hub</a></p>


The peS2o dataset is a collection of ~40M creative commmon licensed academic papers,
cleaned, filtered, and formatted for pre-training of language models. It is derived from
the [Semantic Scholar Open Research Corpus][2]([Lo et al, 2020][1]), or S2ORC.

<p align="center" style="font-size: 1.2em;">peS2o is available on the <span><img src="res/hf-logo.png" width=auto height=30px style="margin: -8px auto;"></span> <a href="https://huggingface.co/datasets/allenai/pes2o">Huggingface Hub</a>!</p>


```python
from datasets import load_dataset
dataset = load_dataset("allenai/peS2o", "v2", split="train")
```

We release multiple version of peS2o, each with different processing and knowledge cutoff
date. We recommend you to use the latest version available.
Expand All @@ -20,7 +27,7 @@ If you use this dataset, please cite:
year = 2023,
title = {{peS2o (Pretraining Efficiently on S2ORC) Dataset}},
institution = {{Allen Institute for AI}},
note = {\url{https://huggingface.co/datasets/allenai/pes2o}}
note = {ODC-By, \url{https://github.com/allenai/pes2o}}
}
```

Expand Down
Binary file added res/hf-logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added res/logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit e5be951

Please sign in to comment.