This repository contains the data and the implementation of the experiments of the paper HQP: A Human-Annotated Dataset for Detecting Online Propaganda.
In this work we present HQP, a high-quality human-annotated dataset for detecting online propaganda. Our work additionally includes:
- Experiments on the performance of fully fine-tuning state-of-the-art pretrained language models on the task of detecting online propaganda on our dataset.
- Experiments on the performance of state-of-the-art few-shot learning (prompt-based learning with LMBFF) on our dataset.
- An extension of prompt-based learning to the setting of multiple connected labels and to handle numerical features.
- Experiments on the performance of incorporating author features in both full fine-tuning and few-shot learning.
You can find more details of this work in our paper.
To run our code, please install all the dependency packages by using the following command:
pip install -r requirements.txt
We only share tweet-ids and labels and not processed datasets in this repository due to Twitter privacy policy. However, upon individual requests ([email protected]
), we can make the data available.
We keep our annotated HQP dataset in Data/HiQualProp
.
For few-shot learning, data samples would be generated to few-shot/data_splitted
automatically when experiments are executed.
To replicate full fine-tuning as in our work, execute the following command:
python full/classification_trainer.py --multirun conf_hqp=hiqualprop,hiqualprop_mf,twe,tweetspin,weaklabels
This will fine-tune BERT-large, RoBERTa-large, and BERTweet-large (each 5 runs by default) on:
- our HQP dataset (hiqualprop)
- our HQP dataset while incorporating author features (hiqualprop_mf)
- the TWE dataset (twe)
- the replicated TWEETSPIN dataset (tweetspin)
- weak labels from our dataset (weaklabels)
Alternatively they can be executed seperately using e.g.:
python full/classification_trainer.py conf_hqp=hiqualprop
Performance evaluation and logging will be generated to the designated folder in full
. The config files are in full/conf_hqp
.
To replicate prompt-based learning as in our work, execute the following command:
python few_shot/multiprompt_head_trainer.py --multirun conf_multiprompt=k16,k32,k64,k128
This will first perform (5 runs each) prompt-based learning as in LMBFF for the different sample sizes (16, 32, 64, 128) and for the two labels (BL and PSL). Then the different classification heads (elastic net and neural net) are trained on the varbalizer probabilites (either alone, or including different sets of author features) for the different sample sizes (again 5 runs each). Evaluation and logging for both LMBFF and the classification heads are generated to the designated folders in few_shot
. The configs are in few_shot/conf_multiprompt
.
If you wish to only execute LMBFF training then execute the following command:
python few_shot/lmbff_trainer.py --multirun conf_lmbff=k16_bc,k32_bc,k64_bc,k128_bc,k16_mc,k32_mc,k64_mc,k128_mc
Where _bc performs LMBFF for the binary propaganda label and _mc for the propaganda strategy label. The configs are in few_shot/conf_lmbff
.
Again, both procedures can be executed for different sample sizes seperately, e.g.:
python few_shot/lmbff_trainer.py conf_lmbff=k16_bc
Please address any issues rergarding the code to Abdurahman Maarouf ([email protected]
).
Please cite our paper if you use HQP in your work:
@article{maarouf2023hqp,
title={HQP: A Human-Annotated Dataset for Detecting Online Propaganda},
author={Maarouf, Abdurahman and B{\"a}r, Dominik and Geissler, Dominique and Feuerriegel},
booktitle={arXiv},
year={2023}
}