Skip to content

Companion code for the research paper "CoAug: Combining Augmentation of Labels and Labeling Rules"

License

Notifications You must be signed in to change notification settings

boschresearch/CoAug

Repository files navigation

CoAug

This is the companion code for the research paper "CoAug: Combining Augmentation of Labels and Labeling Rules". The code allows the users to reproduce and extend the results reported in the study. Please cite the above paper when reporting, reproducing or extending the results.

Purpose of the project

This software is a research prototype, solely developed for and published as part of the publication "CoAug: Combining Augmentation of Labels and Labeling Rules". It will neither be maintained nor monitored in any way.

1. Setup

Setup environment by running source bin/init.sh. This will

  • Install and setup environment with correct dependencies.
  • Download the QuIP models for CoAug+QuIP experiments
  • We assume python and python venv is already installed in the system. The script has been verified to run with python 3.8.

2. Experiments in Paper

In this section, we introduce how to reimplement the experiments in our paper. We already include all needed datasets and rule files in this repo.

At the start of every experiment, please run bash bin/setup.sh to setup the right environment.

TaLLOR

python train_proto.py --dataset ${dataset} --encoder ${encoder} --mode tallor --seed ${seed} --rule_topk 20 --ap_threshold 0.75

where ${dataset} is one of bc5cdr/ncbi_disease/conll2003/wikigold ${encoder} is scibert (science-domain) or bert (general-domain)

ProtoBERT

python train_proto.py --dataset ${dataset} --encoder ${encoder} --mode proto --seed ${seed} --rule_topk 20 --ap_threshold 0.75

where ${dataset} is one of bc5cdr/ncbi_disease/conll2003/wikigold ${encoder} is scibert (science-domain) or bert (general-domain)

CoAug + ProtoBERT

python train_proto.py --dataset ${dataset} --encoder ${encoder} --mode coaug --seed ${seed} --rule_topk 20 --ap_threshold 0.75

where ${dataset} is one of bc5cdr/ncbi_disease/conll2003/wikigold ${encoder} is scibert (science-domain) or bert (general-domain)

QuIP

python train_quip.py --dataset ${dataset} --encoder ${encoder} --mode quip --seed ${seed} --rule_topk 20 --ap_threshold 0.75

where ${dataset} is one of bc5cdr/ncbi_disease/conll2003/wikigold ${encoder} is scibert (science-domain) or bert (general-domain)

CoAug + QuIP

python train_quip.py --dataset ${dataset} --encoder ${encoder} --mode coaug --seed ${seed} --rule_topk 20 --ap_threshold 0.75

where ${dataset} is one of bc5cdr/ncbi_disease/conll2003/wikigold ${encoder} is scibert (science-domain) or bert (general-domain)

The output will be in the experiment directory exp_out/{dataset_name}/{tallor/proto/quip/coaug}/{ProtoBERT/QuIP}/{seed}/{timestamp}/.

3. Dataset

4 datasets are preprocessed and included in this repository.

Dataset Task code Dir Source
BC5CDR bc5cdr data/bc5cdr link
NCBI Disease ncbi_disease data/ncbi_disease link
CoNLL 2003 conll2003 data/conll2003 link
Wiki Gold wikigold data/wikigold link

Dataset Prepration

We follow TALLOR's process to prepare the dataset, which uses AutoPhrase for preprocessing. For details, please refer to TALLOR's introduction.

4. Contact

For any doubts or questions regarding the work, please contact Rakesh ([email protected]). For any bug or issues with the code, feel free to open a GitHub issue or pull request.

5. Citation

@inproceedings{menon2023coaug,
    Author = {Menon, Rakesh R. and Wang, Bingqing and Araki, Jun and Zhou, Zhengyu and Feng, Zhe and Ren, Liu},
    Title = {{CoAug}: {C}ombining {A}ugmentation of {L}abels and {L}abeling {R}ules},
    booktitle = {Findings of the Association for Computational Linguistics: ACL 2023},
    month = jul,
    year= "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics"
}

License

CoAug is open-sourced under the AGPL-3.0 license. See the LICENSE file for details.

For a list of other open source components included in CoAug, see the file 3rd-party-licenses.txt.

About

Companion code for the research paper "CoAug: Combining Augmentation of Labels and Labeling Rules"

Topics

Resources

License

Stars

Watchers

Forks