Skip to content

loriqing/dataset_mart

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 

Repository files navigation

dataset_mart

  • SciFact

Fact or Fiction: Verifying Scientific Claims [description] SciFact, a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts annotated with labels and rationales. paper source

  • health fact checking

Explainable Automated Fact-Checking for Public Health Claims [description]PUBHEALTH of 11.8K claims accompanied by journalist crafted, gold standard explanations (i.e., judgments) to support the fact-check labels for claims1. Explore two tasks: veracity prediction and explanation generation. paper source

  • SLAKE

SLAKE: A SEMANTICALLY-LABELED KNOWLEDGE-ENHANCED DATASET FOR MEDICAL VISUAL QUESTION ANSWERING [description] A large bilingual dataset, SLAKE, with comprehensive semantic labels annotated by experienced physicians and a new structural medical knowledge base for Med-VQA. Besides, SLAKE includes richer modalities and covers more human body parts than the currently available dataset. paper source

  • SCIREX 438

SCIREX: A Challenge Dataset for Document-Level Information Extraction paper source

[domain]CS

[description]named entity include types of (Matrial, Method, Metric, Task).

  • SCIERC 500

Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction paper source

[domain]CS

[description]named entity include types of (Matrial, Method, Metric, Task, Generic, OtherItem).

  • DOCERC (Human-annotated)5,053 (Distantly Supervised)101,873

DocRED: A Large-Scale Document-Level Relation Extraction Dataset paper source

[domain] general

[description]DocRED covers a variety of entity types, including person (18.5%), location (30.9%), organization (14.4%), time (15.8%) and number (5.1%). Includes 96 frequent relation types from Wikidata. A notable property of our dataset is that the relation types cover a broad range of categories, including relations relevant to science (33.3%), art (11.5%), time (8.3%), personal life (4.2%), etc.,

  • NLPContributionGraph

NLPContributions: An Annotation Scheme for Machine Reading of Scholarly Contributions in Natural Language Processing Literature paper source github

[domain]

[description] The pilot annotation exercise was performed on 50 NLP-ML scholarly articles presenting contributions to the five information extraction tasks 1. machine translation, 2. named entity recognition, 3. question answering, 4. relation classification, and 5. text classification.

  • SciORC (PDF-parse)8.1M (LATEX-parse)1.5M

S2orc: The semantic scholar open research corpus paper source

[domain] PDF-parse is multi-domain, LATEX-parse is physics, math, CS domain

[description]metadata, full text, inline citations and references, and bibliography entries

  • ELSEVIER OA CC-BY CORPUS 2000

ELSEVIER OA CC-BY CORPUS paper source github

[domain] 2000 documents were sampled from each of the 27 top-level subject

[description] body, abstract, metadata

  • CORD-19

CORD-19: The COVID-19 Open Research Dataset paper source github

  • BC5CDR

BioCreative V CDR task corpus: a resource for chemical disease relation extraction paper source

[domain] disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction

[description]The BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions.

workshop

  • 3C - Citation Context Classification Subtask A: A task for identifying the purpose of a citation. Multiclass classification of citations into one of six classes: Background, Uses, Compare_Contrast, Motivation, Extension, and Future. kaggle2020 Subtask B: A task for identifying the importance of a citation. Binary classification of citations into one of two classes: Incidental, and Influential.kaggle2020

  • LongSumm - Long Summaries for Scientific Documents focuses on generating long summaries for scientific documents. source AILeaderboard

  • SCIVER - Verifying Scientific Claims with Evidence github source

  • TDMS -A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics annotations for Task (T), Dataset (D), Metric (M) entities on 2,000 sentences extracted from NLP papers. Identification of Tasks, Datasets, Evaluation Metrics, and Numeric Scores for Scientific Leaderboards Construction with 400 paper github


  • AnchorNER

Towards Open-Domain Named Entity Recognition via Neural Correction Models paper source github

[domain] open-domain, general

[description]We apply the pipeline described hereafter to a dump of abstracts of English Wikipedia from 2017 and obtain AnchorNER. This dataset is built out of 5.2M abstracts of Wikipedia articles, consisting of 268M tokens accounting for 12M sentences.

  • BBN

Bbn pronoun coreference and entity type corpus source

[description] fine-grained entity type, annotates 2,311 Wall Street Journal articles in Treebank-2 (LDC95T7) with fine-grained entity types.

  • OntoNotes

OntoNotes: A Large Training Corpus for Enhanced Processing paper source

[description] OntoNotes corpus using a three-layer set of 87 types

  • FIGER

Fine-grained entity recognition paper github source

[description] contains 2.7 million automatically labeled training instances from Wikipedia and 434 manually annotated sentences from news reports

  • KNET

Improving Neural Fine-Grained Entity Typing with Knowledge Attention paper github

[description] It consists of an automatically annotated subset (WIKI-AUTO) and a manually annotated (WIKI-MAN) test set.

  • Open-type

Ultra-Fine Entity Typing paper source github

[description] To capture multiple domains, we sample sentences from Gigaword (Parker et al., 2011), OntoNotes (Hovy et al., 2006), and web articles (Singh et al., 2012). We select entity mentions by taking maximal noun phrases from a constituency parser (Manning et al., 2014) and mentions from a coreference resolution system (Lee et al., 2017). 6000 examples,each example has 5 labels: 0.9 general, 0.6 fine-grained, and 3.9 ultra-fine types. • 9 general types: person, location, object, organization, place, entity, object, time, event • 121 fine-grained types • 10,201 ultra-fine types

  • cfet 中文细粒度entity typing数据集

A Chinese Corpus for Fine-grained Entity Typing paper github source

[description] We gather our entity mentions from four different sources: Golden Horse (He and Sun, 2016), Boson dataset provided by BosonNLP1, MSRA’s open source NER dataset2, and PKU’s Corpus of Multi-level Processing for Modern Chinese (Yu et al., 2018). MSRA and PKU’s dataset, the sentences are mostly extracted from news or magazines, and thus are more formal and detailed. For the Golden Horse dataset, most of them are extracted from Weibo (a Chinese social media website similar to Twitter) posts, which are far more informal. We extract mentions from these sources and amass around 4,800 entity mentions with context sentences. 80% of the mentions are named entities (e.g.香港/Hong Kong, 苹果公司/Apple Inc., 勒布朗-詹姆 斯/LeBron James) and 20% of them are pronouns.

  • DialogRE

Dialogue-Based Relation Extraction paper github source

[domain] conversational/Friends

[description]keep 2,100 triples in total, whose two arguments are in “no relation”, and we finally have 10,168 triples for 1,788 dialogues. We randomly split them at the dialogue level, with 60% for training, 20% for development, and 20% for testing.Based on the predefined SF and DialogRE relation types, a subject is expected to be an entity of type PER, ORG, or geo-political entity (GPE). Notably, subjects of most relational triples (96.8%vs. 69.7%intheSFdataset)inDialogRE are person names. The coarse-grained object type is entity, string, or value (i.e., a numerical value or a date). 相关数据集* DREAM * C3

  • DREAM

paper github source

[description]DREAM is a multiple-choice Dialogue-based REAding comprehension exaMination dataset. In contrast to existing reading comprehension datasets, DREAM is the first to focus on in-depth multi-turn multi-party dialogue understanding.

  • C3

paper github source

[description]C3 is the first free-form multiple-Choice Chinese machine reading Comprehension dataset, containing 13,369 documents (dialogues or more formally written mixed-genre texts) and their associated 19,577 multiple-choice free-form questions collected from Chinese-as-a-second language examinations.

  • MAVEN (未发布)

MAVEN: A Massive General Domain Event Detection Dataset paper []

纯关系的数据集

https://github.com/davidsbatista/Annotated-Semantic-Relationships-Datasets

https://lipn.univ-paris13.fr/~gabor/semeval2018task7/

About

collection of scientific literature dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages