Adapter-Bert Networks

Code for our NeurIPS 2020 paper "Incorporating BERT into Parallel Sequence Decoding with Adapters". Please cite our paper if you find this repository helpful in your research:

@article{guo2020incorporating,
  title={Incorporating BERT into Parallel Sequence Decoding with Adapters},
  author={Guo, Junliang and Zhang, Zhirui and Xu, Linli and Wei, Hao-Ran and Chen, Boxing and Chen, Enhong},
  journal={arXiv preprint arXiv:2010.06138},
  year={2020}
}

Requirements

The code is based on fairseq-0.6.2, PyTorch-1.2.0 and cuda-9.2. The BERT implementation is heavily inspired by bert-nmt and Huggingface Transformers, many thanks to the authors for making their code avaliable.

Instructions

Below is the instruction to reproduce our results on the IWSLT14 German-English translation task with mask-predict decoding.

Data Preprocessing

We tokenize and segment each word into wordpiece tokens using the same vocabulary as pre-trained BERT models, following the implementation in Huggingface Transformers. We provide the wordpiece tokenized IWSLT14 De-En dataset in this link.

Then preprocess data like fairseq:

python preprocess.py --task bert_xymasked_wp_seq2seq \
  --source-lang de --target-lang en \
  --srcdict $TEXT/count-bert-base-german-cased-vocab.txt \
  --tgtdict $TEXT/count-bert-base-uncased-vocab.txt \
  --trainpref $TEXT/train.wordpiece --validpref $TEXT/valid.wordpiece --testpref $TEXT/test.wordpiece \
  --destdir $DATA_DIR --workers 20

Train an Adapter-Bert Network

We provide an example of the training script:

python train.py $DATA_DIR \
  --task bert_xymasked_wp_seq2seq -s de -t en \
  -a transformer_nat_ymask_bert_two_adapter_deep_small \
  --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
  --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr '1e-07' \
  --lr 0.0005 --min-lr '1e-09' \
  --criterion label_smoothed_length_cross_entropy --label-smoothing 0.1 \
  --weight-decay 0.0 --max-tokens 2000 --update-freq 2 --max-update 200000 \
  --left-pad-source False --adapter-dimension 512 \
  --use-adapter-bert --bert-model-name bert-base-german-cased --decoder-bert-model-name bert-base-uncased

We conduct our experiments on a 12GB Nvidia 1080Ti GPU, and we set --max-tokens to 2000 and --update-freq to 2 due to the limited GPU memory. In a GPU with larger memory, you can set --max-tokens to 4096 and --update-freq to 1 to speedup the training.

Generate with Mask-Predict Decoding

We report the performance of the average of last 10 checkpoints. And we provide an example of the generation script:

python generate.py $DATA_DIR \
  --task bert_xymasked_wp_seq2seq --bert-model-name bert-base-german-cased \
  --path checkpoint_aver.pt --decode_use_adapter \
  --mask_pred_iter 10 --left-pad-source False \
  --batch-size 32 --beam 4 --lenpen 1.1 --remove-bpe wordpiece

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
bert		bert
fairseq		fairseq
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate.py		generate.py
interactive.py		interactive.py
preprocess.py		preprocess.py
score.py		score.py
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adapter-Bert Networks

Requirements

Instructions

Data Preprocessing

Train an Adapter-Bert Network

Generate with Mask-Predict Decoding

About

Releases

Packages

Languages

License

lemmonation/abnet

Folders and files

Latest commit

History

Repository files navigation

Adapter-Bert Networks

Requirements

Instructions

Data Preprocessing

Train an Adapter-Bert Network

Generate with Mask-Predict Decoding

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages