-- This is the pytorch implementation of our ACL 2022 paper "Dict-BERT: Enhancing Language Model Pre-training with Dictionary" [PDF]. In this paper, we propose DictBERT, which is a novel pre-trained language model by leveraging rare word definitions in English dictionaries (e.g., Wiktionary). DictBERT is based on the BERT architecture, trained under the same setting as BERT. Please refer more details in our paper.
python version >=3.6
transformers==4.7.0
datasets==1.8.0
torch==1.8.0
Also need to install dataclasses
, scipy
, sklearn
, nltk
-- download Wiktionary
cd preprocess_wiktionary
bash download_wiktionary.sh
-- download GLUE benchmark
cd preprocess_datasets
bash load_preprocess.sh
-- Huggingface Hub [link]
git lfs install
git clone https://huggingface.co/wyu1/DictBERT
-- without dictionary
cd finetune_wo_wiktionary
bash finetune.sh
-- with dictionary
cd finetune_wi_wiktionary
bash finetune.sh
@inproceedings{yu2022dict,
title={Dict-BERT: Enhancing Language Model Pre-training with Dictionary},
author={Yu, Wenhao and Zhu, Chenguang and Fang, Yuwei and Yu, Donghan and Wang, Shuohang and Xu, Yichong and Zeng, Michael and Jiang, Meng},
booktitle={Findings of the Association for Computational Linguistics: ACL 2022},
pages={1907--1918},
year={2022}
}
Please kindly cite our paper if you find this paper and the codes helpful.
Many thanks to the Github repository of Transformers. Part of our codes are modified based on their codes.