This repository contains the implementation of BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining, by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu.
- BioGPT-Large model with 1.5B parameters is coming, currently available on PubMedQA task with SOTA performance of 81% accuracy. See Question Answering on PubMedQA for evaluation.
- PyTorch version == 1.12.0
- Python version == 3.10
- fairseq version == 0.12.0:
git clone https://github.com/pytorch/fairseq
cd fairseq
git checkout v0.12.0
pip install .
python setup.py build_ext --inplace
cd ..
- Moses
git clone https://github.com/moses-smt/mosesdecoder.git
export MOSES=${PWD}/mosesdecoder
- fastBPE
git clone https://github.com/glample/fastBPE.git
export FASTBPE=${PWD}/fastBPE
cd fastBPE
g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast
- sacremoses
pip install sacremoses
- sklearn
pip install scikit-learn
Remember to set the environment variables MOSES
and FASTBPE
to the path of Moses and fastBPE respetively, as they will be required later.
We provide our pre-trained BioGPT model checkpoint along with fine-tuned checkpoints for downstream tasks
Model | Description | URL |
---|---|---|
BioGPT | Pre-trained BioGPT model checkpoint | link |
BioGPT-Large | Pre-trained BioGPT-Large model checkpoint | link |
BioGPT-QA-PubMedQA-BioGPT | Fine-tuned BioGPT for question answering task on PubMedQA | link |
BioGPT-QA-PubMEDQA-BioGPT-Large | Fine-tuned BioGPT-Large for question answering task on PubMedQA | link |
BioGPT-RE-BC5CDR | Fine-tuned BioGPT for relation extraction task on BC5CDR | link |
BioGPT-RE-DDI | Fine-tuned BioGPT for relation extraction task on DDI | link |
BioGPT-RE-DTI | Fine-tuned BioGPT for relation extraction task on KD-DTI | link |
BioGPT-DC-HoC | Fine-tuned BioGPT for document classification task on HoC | link |
Download them and extract them to the checkpoints
folder of this project.
For example:
mkdir checkpoints
cd checkpoints
wget https://msramllasc.blob.core.windows.net/modelrelease/BioGPT/checkpoints/Pre-trained-BioGPT.tgz
tar -zxvf Pre-trained-BioGPT.tgz
Use pre-trained BioGPT model in your code:
import torch
from fairseq.models.transformer_lm import TransformerLanguageModel
m = TransformerLanguageModel.from_pretrained(
"checkpoints/Pre-trained-BioGPT",
"checkpoint.pt",
"data",
tokenizer='moses',
bpe='fastbpe',
bpe_codes="data/bpecodes",
min_len=100,
max_len_b=1024)
m.cuda()
src_tokens = m.encode("COVID-19 is")
generate = m.generate([src_tokens], beam=5)[0]
output = m.decode(generate[0]["tokens"])
print(output)
Use fine-tuned BioGPT model on KD-DTI for drug-target-interaction in your code:
import torch
from src.transformer_lm_prompt import TransformerLanguageModelPrompt
m = TransformerLanguageModelPrompt.from_pretrained(
"checkpoints/RE-DTI-BioGPT",
"checkpoint_avg.pt",
"data/KD-DTI/relis-bin",
tokenizer='moses',
bpe='fastbpe',
bpe_codes="data/bpecodes",
max_len_b=1024,
beam=1)
m.cuda()
src_text="" # input text, e.g., a PubMed abstract
src_tokens = m.encode(src_text)
generate = m.generate([src_tokens], beam=args.beam)[0]
output = m.decode(generate[0]["tokens"])
print(output)
For more downstream tasks, please see below.
See corresponding folder in examples: