Skip to content

Latest commit

 

History

History
 
 

adalm

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

AdaLM

Domain, language and task adaptation of pre-trained models.

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains. Yunzhi Yao, Shaohan Huang, Wenhui Wang, Li Dong and Furu Wei, ACL 2021

This repository includes the code to finetune the adapted domain-specific model on downstrem tasks and the code to generate incremental vocabulary for specific domain.

Pre-trained Model

The adapted domain-specific model can be download:

  • AdaLM-bio-base 12-layer, 768-hidden, 12-heads, 132M parameters || One Drive
  • AdaLM-bio-small 6-layer, 384-hidden, 12-heads, 34M parameters || One Drive
  • AdaLM-cs-base 12-layer, 768-hidden, 12-heads, 124M parameters || One Drive
  • AdaLM-cs-small 6-layer, 384-hidden, 12-heads, 30M parameters || One Drive

Fine-tuning Examples

Requirements

​ Install the requirements:

pip install -r requirements.txt

Add the project to your PYTHONPATH

export PYTHONPATH=$PYTHONPATH:`pwd`

Download Fine-tune Datasets

The biomedical downstream task can be download from BLURB Leaderboard . The computer science tasks can be download from allenai

Finetune Classification Task

# Set path to read training/dev dataset
export DATASET_PATH=/path/to/read/glue/task/data/            # Example: "/path/to/downloaded-glue-data-dir/mnli/"

# Set path to save the finetuned model and result score
export OUTPUT_PATH=/path/to/save/result_of_finetuning

export TASK_NAME=chemprot
# Set path to the model checkpoint you need to test 
export CKPT_PATH=/path/to/your/model/checkpoint

# Set config file
export CONFIG_FILE=/path/to/config/file

# Set vocab file
export VOCAB_FILE=/path/to/vocab/file

# Set path to cache train & dev features (tokenized, only use for this tokenizer!)
export TRAIN_CACHE=${DATASET_PATH}/$TASK_NAME.bert.cache
export DEV_CACHE=${DATASET_PATH}/$TASK_NAME.bert.cache

# Setting the hyperparameters for the run.
export BSZ=32
export LR=1.5e-5
export EPOCH=30
export WD=0.1
export WM=0.1
CUDA_VISIBLE_DEVICES=0 python finetune/run_classifier.py \
   --model_type bert --model_name_or_path $CKPT_PATH \
   --config_name $CONFIG_FILE --tokenizer_name $VOCAB_FILE --do_lower_case\
   --data_dir $DATASET_PATH --cached_train_file $TRAIN_CACHE --cached_dev_file $DEV_CACHE \
   --do_train --do_eval --logging_steps 1000 --output_dir $OUTPUT_PATH --max_grad_norm 0 \
   --max_seq_length 128 --per_gpu_train_batch_size $BSZ --learning_rate $LR \
   --num_train_epochs $EPOCH --weight_decay $WD --warmup_ratio $WM \
   --fp16 --fp16_opt_level O2 --seed 42 --overwrite_output_dir

Finetune NER Task

To finetune the PICO task, just need to change the run_ner to run_pico.

# Set path to read training/dev dataset
export DATASET_PATH=/path/to/ner/task/data/           

# Set path to save the finetuned model and result score
export OUTPUT_PATH=/path/to/save/result_of_finetuning

export TASK_NAME=chemprot
# Set path to the model checkpoint you need to test 
export CKPT_PATH=/path/to/your/model/checkpoint

# Set config file
export CONFIG_FILE=/path/to/config/file

# Set vocab file
export VOCAB_FILE=/path/to/vocab/file

# Set label file  such as the BIO tag
export LABEL_FILE=/path/to/vocab/file

# Set path to cache train & dev features (tokenized, only use for this tokenizer!)
export CACHE_DIR=/path/to/cache

# Setting the hyperparameters for the run.
export BSZ=16
export LR=1.5e-5
export EPOCH=30
export WD=0.1
export WM=0.1
CUDA_VISIBLE_DEVICES=0 python finetune/run_ner.py \
   --model_type bert --model_name_or_path $CKPT_PATH \
   --config_name $CONFIG_FILE --tokenizer_name $VOCAB_FILE --do_lower_case\
   --data_dir $DATASET_PATH--cache_dir $CACHE_DIR --labels $LABEL_FILE \
   --do_train --do_eval --logging_steps 1000 --output_dir $OUTPUT_PATH --max_grad_norm 0 \
   --max_seq_length 128 --per_gpu_train_batch_size $BSZ --learning_rate $LR \
   --num_train_epochs $EPOCH --weight_decay $WD --warmup_ratio $WM \
   --fp16 --fp16_opt_level O2 --seed 42 --overwrite_output_dir

Results

Biomedical

JNLPBA PICO ChemProt Average
BERT 78.63 72.34 71.86 74.28
BioBERT 79.35 73.18 76.14 76.22
PubmedBERT 80.06 73.38 77.24 76.89
AdaLM-bio-base 79.46 75.47 78.41 77.74
AdaLM-bio-small 79.04 74.91 72.06 75.34

Computer Science

ACL-ARC SCIERC Average
BERT 64.92 81.14 73.03
AdaLM-cs-base 73.61 81.91 77.76
AdaLM-cs-small 68.74 78.88 73.81

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the transformers project. Microsoft Open Source Code of Conduct

Contact Information

For help or issues using AdaLM, please submit a GitHub issue.

For other communications related to AdaLM, please contact Shaohan Huang ([email protected]), Furu Wei ([email protected]).