Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add examples of advanced SYNERGY use #96

Merged
merged 40 commits into from
Apr 24, 2023
Merged
Changes from 1 commit
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
0a89e4a
Add new README to branch
J535D165 Mar 22, 2023
0b25352
Update README.md
J535D165 Mar 22, 2023
41dfbd8
Add files via upload
J535D165 Mar 22, 2023
f81c8fa
Add image
J535D165 Mar 22, 2023
9ff3262
Update README.md
J535D165 Mar 22, 2023
92e25d3
Add Kwok dataset
J535D165 Mar 22, 2023
bdd4ac8
Update README.md
J535D165 Mar 22, 2023
8033a6c
Update README.md
J535D165 Mar 22, 2023
a6a146e
Update README.md
J535D165 Mar 22, 2023
7f45467
Update README.md
J535D165 Mar 22, 2023
abb1eb0
Update README.md
J535D165 Mar 22, 2023
b4046a4
Add link to web.archive.org
J535D165 Apr 1, 2023
9bd3dc9
Create ATTRIBUTION.md
J535D165 Apr 2, 2023
e44ac02
Update README.md
J535D165 Apr 2, 2023
c60ea72
Update ATTRIBUTION.md
J535D165 Apr 2, 2023
ef31c70
Update LICENSE
J535D165 Apr 2, 2023
7ee4b0c
Update broken links in ATTRIBUTION.md
J535D165 Apr 4, 2023
7d48dfc
Merge branch 'master' into README
J535D165 Apr 4, 2023
ca0a175
Update numbers in README.md
J535D165 Apr 10, 2023
6694da4
Fix wrong percentage
J535D165 Apr 10, 2023
4342069
Add examples on Python package
J535D165 Apr 15, 2023
51dd9d8
Update README.md
J535D165 Apr 15, 2023
d8de592
Update README.md
J535D165 Apr 15, 2023
b37200f
Update attribution
J535D165 Apr 15, 2023
412eba5
Update ATTRIBUTION.md
J535D165 Apr 15, 2023
5ba3470
Update ATTRIBUTION.md
J535D165 Apr 15, 2023
b8ec22d
Update README.md
J535D165 Apr 15, 2023
dd95287
Update README.md
J535D165 Apr 15, 2023
f42e058
Add LICENSE info
J535D165 Apr 16, 2023
bce3a20
Update license text
J535D165 Apr 16, 2023
10fe4a0
Update codebook
J535D165 Apr 16, 2023
202c2ee
Merge branch 'README' into examples
J535D165 Apr 16, 2023
b7eaad3
Merge branch 'master' into README
J535D165 Apr 16, 2023
ad598cc
Merge branch 'README' into examples
J535D165 Apr 16, 2023
cab2654
Add notebook on concepts in SYNERGY
J535D165 Apr 17, 2023
da3c1d4
Add more API examples
J535D165 Apr 24, 2023
5519fe1
Remove changes to readme
J535D165 Apr 24, 2023
4bc5697
Delete ATTRIBUTION.md
J535D165 Apr 24, 2023
dd337f8
Add Attribution
J535D165 Apr 24, 2023
9cd3fb6
Merge branch 'master' into examples
J535D165 Apr 24, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Remove changes to readme
  • Loading branch information
J535D165 committed Apr 24, 2023
commit 5519fe1b7c66c7e46cbbc79e20dcdfca1b6d0b99
93 changes: 40 additions & 53 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,59 +1,57 @@

# :exclamation: This is work in progress, please do NOT use. Public release under open license will follow soon. Questions? Contact [email protected].


# SYNERGY dataset

SYNERGY is a free and open dataset on study selection in systematic reviews, comprising 169,288 academic works from 26 systematic reviews. Only 2834 (1.67%) of the academic works in the binary classified dataset are included in the systematic reviews. This makes the SYNERGY dataset an interesting dataset for the development of information retrieval algorithms, especially for sparse labels. Due to the many available variables available per record (i.e. titles, abstracts, authors, references, topics), this dataset is useful for researchers in NLP, information retrieval, network analysis and more. In total, the dataset contains 82,668,134 trainable data points.
SYNERGY is a free and open dataset on study selection in systematic reviews, comprising 169,288 academic works from 26 systematic reviews. Only 2,834 (1.67%) of the academic works in the binary classified dataset are included in the systematic reviews. This makes the SYNERGY dataset a unique dataset for the development of information retrieval algorithms, especially for sparse labels. Due to the many available variables available per record (i.e. titles, abstracts, authors, references, topics), this dataset is useful for researchers in NLP, machine learning, network analysis, and more. In total, the dataset contains 82,668,134 trainable data points.

[![SYNERGY-banner.png](SYNERGY-banner.png)]()

## Get the data

The easiest way to get the SYNERGY dataset is via the `synergy-dataset` Python package.
The easiest way to get the SYNERGY dataset is via the `synergy-dataset` Python package. Install the package with:

```bash
pip install --pre synergy-dataset
pip install synergy-dataset
```

To download and build the SYNERGY dataset, run the following command in the command line:

```python
python -m synergy_dataset get
```

You can get an overview of the datasets and their properties with `synergy_dataset list` and `synergy_dataset show <DATASET_NAME>`.
To get an overview of the datasets and their properties, use `synergy_dataset list` and `synergy_dataset show <DATASET_NAME>`.

## Datasets and variables

SYNERGY is a collection of 24 systematic review datasets with in total 169,288 records with 2834 total inclusions. The list of datasets and references:

| Nr | Dataset | Topic(s) | Records | Included | % |
|------|-------------------------|-------------------------------|-----------|------------|------|
| 1 | Appenzeller-Herzog_2019 | Medicine | 2873 | 26 | 0.9 |
| 2 | Bos_2018 | Medicine | 4878 | 10 | 0.2 |
| 3 | Brouwer_2019 | Psychology, Medicine | 38114 | 62 | 0.2 |
| 4 | Chou_2003 | Medicine | 1908 | 15 | 0.8 |
| 5 | Chou_2004 | Medicine | 1630 | 9 | 0.6 |
| 6 | Donners_2021 | Medicine | 258 | 15 | 5.8 |
| 7 | Hall_2012 | Computer science, Engineering | 8793 | 104 | 1.2 |
| 8 | Jeyaraman_2020 | Medicine | 1175 | 96 | 8.2 |
| 9 | Leenaars_2019 | Medicine, Chemistry | 5812 | 17 | 0.3 |
| 10 | Leenaars_2020 | Medicine, Biology | 7216 | 583 | 8.1 |
| 11 | Meijboom_2021 | Medicine, Physics | 882 | 37 | 4.2 |
| 12 | Menon_2022 | Medicine, Psychology | 975 | 74 | 7.6 |
| 13 | Moran_2021 | Psychology, Biology | 5214 | 111 | 2.1 |
| 14 | Muthu_2021 | Medicine, Chemistry | 2719 | 336 | 12.4 |
| 15 | Nelson_2002 | Medicine, Physics | 366 | 80 | 21.9 |
| 16 | Oud_2018 | Psychology, Medicine | 952 | 20 | 2.1 |
| 17 | Radjenovic_2013 | Computer science, Engineering | 5935 | 48 | 0.8 |
| 18 | Sep_2021 | Computer science, Psychology | 271 | 40 | 14.8 |
| 19 | Smid_2020 | Computer science, Mathematics | 2627 | 27 | 1 |
| 20 | van_de_Schoot_2018 | Computer science, Mathematics | 4544 | 38 | 0.8 |
| 21 | Valk_2021 | Medicine, Mathematics | 725 | 89 | 12.3 |
| 22 | van_der_Waal_2022 | Medicine, Political science | 1970 | 33 | 1.7 |
| 23 | van_Dis_2020 | Psychology, Medicine | 9128 | 72 | 0.8 |
| 24 | Walker_2018 | Psychology, Medicine | 48375 | 762 | 1.6 |
| 25 | Wassenaar_2017 | Medicine, Biology | 7668 | 111 | 1.4 |
| 26 | Wolters_2018 | Medicine | 4280 | 19 | 0.4 |
The SYNERGY dataset comprises the study selection of 26 systematic reviews. The dataset contains 169,288 records of which 2,834 records are manually labeled as inclusion by the authors of the systematic review. The list of systematic review and basic properties:

| Nr | Dataset | Topic(s) | Records | Included | % |
|------|-------------------------|---------------------------------|-----------|------------|------|
| 1 | Appenzeller-Herzog_2019 | Medicine | 2873 | 26 | 0.9 |
| 2 | Bos_2018 | Medicine | 4878 | 10 | 0.2 |
| 3 | Brouwer_2019 | Psychology, Medicine | 38114 | 62 | 0.2 |
| 4 | Chou_2003 | Medicine | 1908 | 15 | 0.8 |
| 5 | Chou_2004 | Medicine | 1630 | 9 | 0.6 |
| 6 | Donners_2021 | Medicine | 258 | 15 | 5.8 |
| 7 | Hall_2012 | Computer science | 8793 | 104 | 1.2 |
| 8 | Jeyaraman_2020 | Medicine | 1175 | 96 | 8.2 |
| 9 | Leenaars_2019 | Psychology, Chemistry, Medicine | 5812 | 17 | 0.3 |
| 10 | Leenaars_2020 | Medicine | 7216 | 583 | 8.1 |
| 11 | Meijboom_2021 | Medicine | 882 | 37 | 4.2 |
| 12 | Menon_2022 | Medicine | 975 | 74 | 7.6 |
| 13 | Moran_2021 | Biology, Medicine | 5214 | 111 | 2.1 |
| 14 | Muthu_2021 | Medicine | 2719 | 336 | 12.4 |
| 15 | Nelson_2002 | Medicine | 366 | 80 | 21.9 |
| 16 | Oud_2018 | Psychology, Medicine | 952 | 20 | 2.1 |
| 17 | Radjenovic_2013 | Computer science | 5935 | 48 | 0.8 |
| 18 | Sep_2021 | Psychology | 271 | 40 | 14.8 |
| 19 | Smid_2020 | Computer science, Mathematics | 2627 | 27 | 1 |
| 20 | van_de_Schoot_2018 | Psychology, Medicine | 4544 | 38 | 0.8 |
| 21 | Valk_2021 | Medicine, Psychology | 725 | 89 | 12.3 |
| 22 | van_der_Waal_2022 | Medicine | 1970 | 33 | 1.7 |
| 23 | van_Dis_2020 | Psychology, Medicine | 9128 | 72 | 0.8 |
| 24 | Walker_2018 | Biology, Medicine | 48375 | 762 | 1.6 |
| 25 | Wassenaar_2017 | Medicine, Biology, Chemistry | 7668 | 111 | 1.4 |
| 26 | Wolters_2018 | Medicine | 4280 | 19 | 0.4 |

The each record in the dataset is an [OpenAlex Work object](https://docs.openalex.org/api-entities/works/work-object
) (Copy at [web.archive.org](https://web.archive.org/web/20230331020326/https://docs.openalex.org/api-entities/works/work-object) extracted on 2023-03-31) with following attributes:
Expand All @@ -71,7 +69,7 @@ Some of the notable variables are:
| type | String | The type or genre of the work as defined by https://api.crossref.org/types. |
| publication_year | Integer | The year this work was published. |
| referenced_works | List | List of OpenAlex IDs for works that this work cites. |
| concepts | List | List of wikidata concept objects. |
| concepts | List | List of wikidata concept objects (or topics). |
| best_oa_location | Object | An object with the best available open access location for this work. |
| cited_by_count | Integer | The number of citations to this work at April 1st, 2023. |

Expand All @@ -94,28 +92,17 @@ SYNERGY dataset is released under the [CC0 1.0](LICENSE) license. SYNERGY consis

## Citing SYNERGY dataset

If you use SYNERGY in a scientific publication, we would appreciate references to:

Biblatex entry:

```bib
@online{xxx,
author = {xxx},
title = {xxx},
date = {xxx},
year = {2023},
}
```
If you use SYNERGY in a scientific publication, we would appreciate references to this github repository. A proper reference to this dataset will follow soon.

## Contributing

We are welcoming contributions of all kinds. Some examples are:

- Do you have an openly published systematic review dataset? Read about our ambition to develop SYNERGY+ (SYNERGY Plus), a much larger dataset with lots of new features.
- Do you have an openly published systematic review dataset? Read about our ambition to develop [SYNERGY+ (SYNERGY Plus)](https://github.com/asreview/synergy-dataset/discussions), a much larger dataset with lots of new features.
- Write an [example or tutorial](examples) on how to use SYNERGY and all of its hidden capebilities.
- Write integration to load SYNERGY into existing software like Spacy, Gensim, Tensorflow, Docker.

## Contact

Reach out on the [Discussion forum](https://github.com/asreview/systematic-review-datasets/discussions).
Reach out on the [Discussion forum](https://github.com/asreview/synergy-dataset/discussions).

Loading