Domain, language and task adaptation of pre-trained models.
Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains. Yunzhi Yao, Shaohan Huang, Wenhui Wang, Li Dong and Furu Wei, ACL 2021
This repository includes the code to finetune the adapted domain-specific model on downstrem tasks and the code to generate incremental vocabulary for specific domain.
The adapted domain-specific model can be download:
- AdaLM-bio-base 12-layer, 768-hidden, 12-heads, 132M parameters || One Drive
- AdaLM-bio-small 6-layer, 384-hidden, 12-heads, 34M parameters || One Drive
- AdaLM-cs-base 12-layer, 768-hidden, 12-heads, 124M parameters || One Drive
- AdaLM-cs-small 6-layer, 384-hidden, 12-heads, 30M parameters || One Drive
Install the requirements:
pip install -r requirements.txt
Add the project to your PYTHONPATH
export PYTHONPATH=$PYTHONPATH:`pwd`
The biomedical downstream task can be download from BLURB Leaderboard . The computer science tasks can be download from allenai
# Set path to read training/dev dataset
export DATASET_PATH=/path/to/read/glue/task/data/ # Example: "/path/to/downloaded-glue-data-dir/mnli/"
# Set path to save the finetuned model and result score
export OUTPUT_PATH=/path/to/save/result_of_finetuning
export TASK_NAME=chemprot
# Set path to the model checkpoint you need to test
export CKPT_PATH=/path/to/your/model/checkpoint
# Set config file
export CONFIG_FILE=/path/to/config/file
# Set vocab file
export VOCAB_FILE=/path/to/vocab/file
# Set path to cache train & dev features (tokenized, only use for this tokenizer!)
export TRAIN_CACHE=${DATASET_PATH}/$TASK_NAME.bert.cache
export DEV_CACHE=${DATASET_PATH}/$TASK_NAME.bert.cache
# Setting the hyperparameters for the run.
export BSZ=32
export LR=1.5e-5
export EPOCH=30
export WD=0.1
export WM=0.1
CUDA_VISIBLE_DEVICES=0 python finetune/run_classifier.py \
--model_type bert --model_name_or_path $CKPT_PATH \
--config_name $CONFIG_FILE --tokenizer_name $VOCAB_FILE --do_lower_case\
--data_dir $DATASET_PATH --cached_train_file $TRAIN_CACHE --cached_dev_file $DEV_CACHE \
--do_train --do_eval --logging_steps 1000 --output_dir $OUTPUT_PATH --max_grad_norm 0 \
--max_seq_length 128 --per_gpu_train_batch_size $BSZ --learning_rate $LR \
--num_train_epochs $EPOCH --weight_decay $WD --warmup_ratio $WM \
--fp16 --fp16_opt_level O2 --seed 42 --overwrite_output_dir
To finetune the PICO task, just need to change the run_ner to run_pico.
# Set path to read training/dev dataset
export DATASET_PATH=/path/to/ner/task/data/
# Set path to save the finetuned model and result score
export OUTPUT_PATH=/path/to/save/result_of_finetuning
export TASK_NAME=chemprot
# Set path to the model checkpoint you need to test
export CKPT_PATH=/path/to/your/model/checkpoint
# Set config file
export CONFIG_FILE=/path/to/config/file
# Set vocab file
export VOCAB_FILE=/path/to/vocab/file
# Set label file such as the BIO tag
export LABEL_FILE=/path/to/vocab/file
# Set path to cache train & dev features (tokenized, only use for this tokenizer!)
export CACHE_DIR=/path/to/cache
# Setting the hyperparameters for the run.
export BSZ=16
export LR=1.5e-5
export EPOCH=30
export WD=0.1
export WM=0.1
CUDA_VISIBLE_DEVICES=0 python finetune/run_ner.py \
--model_type bert --model_name_or_path $CKPT_PATH \
--config_name $CONFIG_FILE --tokenizer_name $VOCAB_FILE --do_lower_case\
--data_dir $DATASET_PATH--cache_dir $CACHE_DIR --labels $LABEL_FILE \
--do_train --do_eval --logging_steps 1000 --output_dir $OUTPUT_PATH --max_grad_norm 0 \
--max_seq_length 128 --per_gpu_train_batch_size $BSZ --learning_rate $LR \
--num_train_epochs $EPOCH --weight_decay $WD --warmup_ratio $WM \
--fp16 --fp16_opt_level O2 --seed 42 --overwrite_output_dir
Biomedical
JNLPBA | PICO | ChemProt | Average | |
---|---|---|---|---|
BERT | 78.63 | 72.34 | 71.86 | 74.28 |
BioBERT | 79.35 | 73.18 | 76.14 | 76.22 |
PubmedBERT | 80.06 | 73.38 | 77.24 | 76.89 |
AdaLM-bio-base | 79.46 | 75.47 | 78.41 | 77.74 |
AdaLM-bio-small | 79.04 | 74.91 | 72.06 | 75.34 |
Computer Science
ACL-ARC | SCIERC | Average | |
---|---|---|---|
BERT | 64.92 | 81.14 | 73.03 |
AdaLM-cs-base | 73.61 | 81.91 | 77.76 |
AdaLM-cs-small | 68.74 | 78.88 | 73.81 |
This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the transformers project. Microsoft Open Source Code of Conduct
For help or issues using AdaLM, please submit a GitHub issue.
For other communications related to AdaLM, please contact Shaohan Huang ([email protected]
), Furu Wei ([email protected]
).