Skip to content

Latest commit

 

History

History

data

Downloading and Preprocessing Datasets

Make sure you install the environment (instructions in the parent folder). Then, go to the directory after the dataset name you want to use and execute the run.sh script.

Example: download and preprocess the KPTimes dataset

$ cd kptimes
$ bash run.sh

Data Format

After the preprocessing script finishes, the will be three folders.

  • processed: One json file containing a entry with key id, title, abstract, present_kps, and absent_kps for each document in the training, validation, and test dataset.
  • fairseq: x.source and and x.target files for x in train, valid, and test. Mainly used for validation and finetuning with fairseq. x.source contains the inputs consisting of title and abstract, concatenated with a [sep] token. x.target contains the target keyphrases concatenated with ;.
  • json: same content as fairseq in json format. Used for fine-tuning the sequence generation models.
  • If you wish to run sequence labeling for keyphrase extraction, please follow the data preprocessing procedure in the sequence_tagging folder. After preprocessing, there will be a bioformat folder containing the data required for sequence tagging.

Keyphrase Generation/Extraction Datasets

KP20k, Inspec, NUS, SemEval, Krapivin

KPBiomed

KPTimes

StackEx

OpenKP

OAGK and OAGKX

LDKP

MSMARCO (Query Prediction from Clicked Documents)

Summarization Datasets

arxiv and pubmed