DeepKPG/data at main · uclanlp/DeepKPG

History

Name		Name	Last commit message	Last commit date
parent directory ..
kpbiomed		kpbiomed
kptimes		kptimes
ldkp		ldkp
msmarco		msmarco
oagk		oagk
openkp		openkp
scikp		scikp
stackex		stackex
summarization		summarization
.gitignore		.gitignore
README.md		README.md
bioConverter.py		bioConverter.py
data_stat.py		data_stat.py
format.py		format.py
format_summarization.py		format_summarization.py
prep_util.py		prep_util.py
prepare.py		prepare.py
prepare_summarization.py		prepare_summarization.py

README.md

Downloading and Preprocessing Datasets

Make sure you install the environment (instructions in the parent folder). Then, go to the directory after the dataset name you want to use and execute the run.sh script.

Example: download and preprocess the KPTimes dataset

$ cd kptimes
$ bash run.sh

Data Format

After the preprocessing script finishes, the will be three folders.

processed: One json file containing a entry with key id, title, abstract, present_kps, and absent_kps for each document in the training, validation, and test dataset.
fairseq: x.source and and x.target files for x in train, valid, and test. Mainly used for validation and finetuning with fairseq. x.source contains the inputs consisting of title and abstract, concatenated with a [sep] token. x.target contains the target keyphrases concatenated with ;.
json: same content as fairseq in json format. Used for fine-tuning the sequence generation models.
If you wish to run sequence labeling for keyphrase extraction, please follow the data preprocessing procedure in the sequence_tagging folder. After preprocessing, there will be a bioformat folder containing the data required for sequence tagging.

Keyphrase Generation/Extraction Datasets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

Downloading and Preprocessing Datasets

Example: download and preprocess the KPTimes dataset

Data Format

Keyphrase Generation/Extraction Datasets

KP20k, Inspec, NUS, SemEval, Krapivin

KPBiomed

KPTimes

StackEx

OpenKP

OAGK and OAGKX

LDKP

MSMARCO (Query Prediction from Clicked Documents)

Summarization Datasets

arxiv and pubmed

Files

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

Downloading and Preprocessing Datasets

Example: download and preprocess the KPTimes dataset

Data Format

Keyphrase Generation/Extraction Datasets

KP20k, Inspec, NUS, SemEval, Krapivin

KPBiomed

KPTimes

StackEx

OpenKP

OAGK and OAGKX

LDKP

MSMARCO (Query Prediction from Clicked Documents)

Summarization Datasets

arxiv and pubmed