Dataset pages

If could, note for what purpose dataset was created

Biomedical dataset overview

Bio2rdf

link The latest release of Bio2RDF contains around 11 billion triples which are part of 35 datasets. written here

statistics
This papser summarizes datasets, including
- KB
- Corpus for training
- Shared task
- etc...
Dataset overview (in A neural network multi-task learning approach to biomedical named entity recognition )
Another dataset may be written in Knowledge-driven Entity Recognition and Disambiguation in Biomedical Text

Corpus for biomedical domain

here

Ontologies and Knowledge base

Chemical Entities of Biological Interest :ChEBI ontology* contains more than 50,000 entites, written in Analyzing user interactions with biomedical ontologies: A visual perspective
Online Mendelian Inheritance in Man (OMIM) at OMIM.org is the primary repository of comprehensive, curated information on genes and genetic phenotypes and the relationships between them. For more details, ,for example, see this paper
CentoMD, a comprehensive and global genotype–phenotype database (Centogene's Mutation Database) that focuses primarily on human rare diseases.paper
CTD, KEGG, OMIM, BioGRID, Omnipath, and ChEMBL are refferd in this paper
the Cell Type Ontology (CL), the Chemical Entities of Biological Interest ontology (CHEBI), the NCBI Taxonomy (NCBITaxon), the Protein Ontology (PR), the Sequence Ontology (SO), the Entrez Gene database (EG), and the Gene Ontology (GO) are used for EL task in this paper
CAS registry numbers, PubChem, ChEBI and CHEMBL are writtten here
PDD graph
- The gap between clinical data and biomedical knowledge graphs. link

Entity linking

The Colorado Richly Annotated Full-Text (CRAFT) corpus (Bada et al., 2012) is the largest gold standard corpus with high-quality annotations from multiple] KBs. For more details, see this paper
- The expanded one is [here](Gold-standard ontology-based anatomical annotation in the CRAFT Corpus)
  - we have expanded the CRAFT Corpus with a large set of manually created semantic annotations relying on Uberon, an ontology representing anatomical entities and life-cycle stages of multicellular organisms across species as well as types of multicellular organisms defined in terms of life-cycle stage and sexual characteristics.
CoNLL-YAGO (Hoffart et al., 2011): the CoNLL 2003 shared task including testa of 4791 mentions in 216 documents, and testb of 4485 mentions in 213 documents.For more details, for example, see this paper.
TAC2010 (Ji et al., 2010): constructed for the Text Analysis Conference that comprises 676 mentions in 352 documents for testin]g.
- In this paper, they said that this dataset is a 'easy' case since it has only two mentions per document on avarage and high prior probabilities of correct candidates.
ACE2004 (Ratinov et al., 2011): a subset of ACE2004 co-reference documents including 248 mentions in 35 documents, which is annotated by Amazon Mechanical Turk.
AQUAINT (Milne and Witten, 2008): 50 news articles including 699 mentions from three different news agencies.
WW (Guo and Barbosa, 2017): a new benchmark with balanced prior distributions of mentions, leading to a hard case of disambiguation. It has 6374 mentions in 310 documents automatically extracted from Wikipedia.
- In this paper, they said that this dataset is a 'hard' case since it has the most mentions with balanced prior probabilities.
AIDA is created by (Hoffart et al., 2011), which includes 1,393 docu- ments and 27,816 mentions referring to Wikipedia entries.For more detai, see this paper
CoNLL The CoNLL dataset is a popular NED dataset constructed by Hoffart et al.(Hoffart et al., 2011). The dataset is based on NER data from the CoNLL 2003 shared task, and consists of training, development, and test sets, containing 946, 216, and 231 documents,respectively.For more detail, see here
MedMentions is a corpus of 246K mentions from PubMed abstracts linked to the UMLS knowledge base. For more details, see here. See also BC5-CDR, NCBI-disease, BC-2GN Train, NLM-Citation GIA datasets, linked to UMLS ontologies.
- NCBI focuses on mainly disease entries.
  
  NCBI downloadable dataset
- BC5-CDR(BCV-CDR) contains both chemicals and diseases. (BC5-CDR dataset is here? Still suspecting.)
- BC2-GN and and NLM datasets both contain genes.
Typenet aligns Freebase types with the WordNet hierarchy to obtain nearly 2k entity types. For more details, see here.
- statistics
Microposts2014 / 2015 NEEL The 2014 Microposts dataset (Cano et al., 2014) consists of 3,504 tweets extracted from a much larger collection of over 18 million tweets. The tweets were collected over one month in 2011. The 2014 Microposts challenge dataset was created to benchmark automatic extraction and linking entities.
AIDA-YAGO2, NEEL2014, NEEL2015, OKE2015, RSS500, WES2015, Wikinews 's details are here:Evaluating Entity Linking: An Analysis of Current Benchmark Datasets and a Roadmap for Doing a Better Job
Survey on Challenges of Question Answering in the Semantic Web describes some models and datasets, combined with QA task.
The PHADERA corpus is a semantically annotated corpus for pharmacovigilence (PV), consisting of 597 MEDLINE abstracts. Its fine-grained, multiple levels of annotation, added by domain-experts, make it a unique resource within the field, and aim to encourage the development/adaption of novel machine learning tools for extracting PV-related information from text. Details are here
Gnormpluscorpus Gene/protein annotated corpus for Entity linking. Connected ontology is Entrez gene paper
The BioC-BioGRID corpus was developed that contained both biological entity annotations (gene/protein and organism/species) and molecular interaction annotations (protein–protein and genetic interactions (PPIs and GIs)). corpus link
- Linked gene ontology is Entrez gene

Ontology Mappings

UniProtKB

NER.

BC5CDR is from the most recent BioCreative V Chemical and Disease Mention Recognition task. It has 1,500 articles containing 15,935 Chemical and 12,852 Disease mentions.
- Mention span and ID for KB is also annotated, so this dataset can be used for that of Entity Linking tasks.
NCBI-disease focuses on Disease Name Recognition. It contains 793 abstracts and 6,881 Disease mentions.
LaptopReview is from the SemEval 2014 Challenge, Task 4 Subtask 1 (Pontiki et al., 2014) focusing on laptop aspect term (e.g., “disk drive”) Recognition. It consists of 3,845 review sentences and 3,012 AspectTerm mentions.
WikilinksBED dataset is a large-scale NED dataset of text fragments from the web, which is significantly noisier and more challenging than existing news-based datasets. Research in which this dataset is used is. for example, here
CADEC consists of 1,250 posts from the medical forum AskaPatient. These posts were manually annotated by medical experts and a clinical ter- minologist for drugs, ADEs, diseases, symptoms, and findings. Among all 9,111 annotated entities, there are 1,800 drug entities and 6,318 ADE entities.
see Medication and Adverse Event Extraction from Noisy Text
i2b2 corpra dataset focuses on the identification of medica- tions and medication-related information, such as their dosages, modes of administration, frequen- cies, durations, and reasons for administration, in discharge summaries. see Medication and Adverse Event Extraction from Noisy Text, or other papers.
GENETAG is a tagged corpus for gene/protein named entity recognition. See this paper

Knowledge base/graph

In Entity Linking: An Issue to Extract Corresponding Entity With Knowledge Base(2018), KB and EL datasets, beloging mainly to general domain, is described.
Crosswikis : a dictionary computed from a Google crawl of the web that stores the frequency with which a mention links to a particular entity. see here
SNOMED-CT is a medical knowledge base of 327,000 concepts. For example of use, see Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs
reversed WordNet task 's goal is to associate a multi-word definition to the correct lemma. For example of use, see Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs
UMLS has terminologies related to medical concepts, and, for example, this research uses its MRCONSO table as terminology vocabulaly. Its MRCONSO table has 3,5million entities, represented by 7.6 million strings in total.
- link でも詳細なUMLSについての言及がある。
Drugbank database server serves the DrugBank data in a structured format according to the Linked Data principles.

Entity typing

WIKI and ONTONOTES are well-establised and widely-used entity-typing dataset.

Some NER corpus can be used for Entity typing task, since they have at least typing annotation.

Text classification

Cora dataset is for document classification. For example of use, see Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs

Abbreviation

Abbrev dataset 3 developed by Stevenson, et al. (Stevenson et al., 2009) contains examples of 300 ambiguous abbreviations found in MEDLINE that were initially presented by (Liu et al., 2001). For more detail, see here.

Word Sense Disambiguation / Word Similarity

MSH-WSD dataset 5 developed by (Jimeno- Yepes et al., 2011) contains 203 ambiguous terms and abbreviations from the 2010 MEDLINE base- line. For more detail, see here.
UMNSRS(Pakhomov et al., 2010) datasets consists of pairs of UMLS concepts with judgments of similarity(566 pairs) and relatedness(587 pairs), as assignd by medical experts. For example of use, see here(ACL2018) and here(ACL2018)
WikiSRS is a novel dataset of similarity and relatedness judgments of paired Wikipedia entities, created by them

Corpus

Medline is a bibliographic database containing around 23 million ci- tations to journal articles in the biomedical do- main and is maintained by National Library of Medicine. The 2015 Medline Baseline encom- passes approximately 5,600 journals starting from 1948, and contains 22,775,609 citations, of which 13,835,206 contain abstracts.For more detail, see, for example, here.

Pretrained-Embeddings(biology)

Biomedical word2vec embeddings:issue,issue
Bioasq pretrained data.link
Cambridge Bio word2vec data link
- This paper use this.
BioBert contains wordpiece embeddings of bio-vocabularies.
- For use, you have to preprocess target train/test corpus in a word-pieced manner.

Mappings

Since there are so many ontologies, mappings is needed.

uniprot
iproclass

Gene

Entrez Gene link

included data:

#tax_id GeneID  Symbol  LocusTag        Synonyms        dbXrefs chromosome      map_location    description     type_of_gene    Symbol_from_nomenclature_authority      Full_name_from_nomenclature_authority   Nomenclature_status     Other_designations      Modification_date       Feature_type

under surveying

Drugbank, Bio2RDF, OMIM, NCBI-Gene, KEGG, PharmGKB, WCC, InterPro. iProClass, MEDIC Freebase's entity description

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
img		img
On-going_and_to-go_Shared_Task.md		On-going_and_to-go_Shared_Task.md
README.MD		README.MD
dataset_home.md		dataset_home.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset pages

Biomedical dataset overview

Corpus for biomedical domain

Ontologies and Knowledge base

Entity linking

Ontology Mappings

NER.

Knowledge base/graph

Entity typing

Text classification

Abbreviation

Word Sense Disambiguation / Word Similarity

Corpus

Pretrained-Embeddings(biology)

Mappings

Gene

under surveying

About

Releases

Packages

izuna385/datasets

Folders and files

Latest commit

History

Repository files navigation

Dataset pages

Biomedical dataset overview

Corpus for biomedical domain

Ontologies and Knowledge base

Entity linking

Ontology Mappings

NER.

Knowledge base/graph

Entity typing

Text classification

Abbreviation

Word Sense Disambiguation / Word Similarity

Corpus

Pretrained-Embeddings(biology)

Mappings

Gene

under surveying

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages