Make sure you install the environment (instructions in the parent folder). Then, go to the directory after the dataset name you want to use and execute the run.sh
script.
$ cd kptimes
$ bash run.sh
After the preprocessing script finishes, the will be three folders.
processed
: One json file containing a entry with keyid
,title
,abstract
,present_kps
, andabsent_kps
for each document in the training, validation, and test dataset.fairseq
:x.source
and andx.target
files for x intrain
,valid
, andtest
. Mainly used for validation and finetuning with fairseq.x.source
contains the inputs consisting of title and abstract, concatenated with a[sep]
token.x.target
contains the target keyphrases concatenated with;
.json
: same content asfairseq
in json format. Used for fine-tuning the sequence generation models.- If you wish to run sequence labeling for keyphrase extraction, please follow the data preprocessing procedure in the
sequence_tagging
folder. After preprocessing, there will be abioformat
folder containing the data required for sequence tagging.
- Paper: https://www.aclweb.org/anthology/P17-1054/
- Download data from: https://drive.google.com/open?id=1DbXV1mZXm_o9bgfwPV9PV0ZPcNo1cnLp
- Paper: https://arxiv.org/abs/2211.12124
- Download data from: https://huggingface.co/datasets/taln-ls2n/kpbiomed
- Paper: https://www.aclweb.org/anthology/W19-8617/
- Download data from: https://github.com/ygorg/KPTimes
- Paper: https://www.aclweb.org/anthology/2020.acl-main.710/
- Download data from: https://github.com/memray/OpenNMT-kpg-release
- Paper: https://www.aclweb.org/anthology/D19-1521/
- Download data from: https://github.com/microsoft/OpenKP#download-the-dataset
- Paper: https://www.aclweb.org/anthology/N19-1070/
- Download data from: http:https://hdl.handle.net/11234/1-2943
- Paper: https://arxiv.org/pdf/2203.15349.pdf
- Download data from: https://huggingface.co/datasets/midas/ldkp3k and https://huggingface.co/datasets/midas/ldkp10k
- Paper: https://arxiv.org/abs/2006.05324
- Download data from: https://microsoft.github.io/TREC-2020-Deep-Learning/
- Paper: https://aclanthology.org/N18-2097/
- Download data from: https://github.com/armancohan/long-summarization