Skip to content

ICPSR/dataset-references

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dataset-references

The code in this repository is used to train and apply a Named Entity Recognition (NER) model to detect informal references to datasets in academic literature. The labeled data are derived from the ICPSR Bibliography of Data-Related Literature and the Semantic Scholar Open Research Corpus. This analysis supports the paper, A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature.

DOI

code/ner-demo.ipynb

Demonstration notebook of NER model applied to a paper

code/spacy-ner.ipynb

Training workflow for spaCy NER model using labeled data

config.cfg

NER model training parameters

data/

Datasets are sentences from academic articles named for sources from which they are derived. Training data were labeled, merged, and exported from Prodigy as of May 10, 2022 for use in spaCy with the following recipes:

  • prodigy db-in dataset_name /path/to/_data.jsonl
  • prodigy ner.manual dataset_name --label DATASET
  • prodigy data-to-spacy train --ner bibliography, paperpile, s2orc