Skip to content

fanganpai/fp2bert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FP-BERT

FP-BERT is a pre-training based method for the QSAR Problem. We pre-trained a bi-directional encoder representations from transformers (BERT) encoder to obtain the semantic representation of compound fingerprints, called Fingerprints-BERT (FP-BERT), in a self-supervised learning manner. Then, the encoded molecular representation by the FP-BERT is input to the convolutional neural network (CNN) to extract higher-level abstract features, and the predicted properties of the molecule are finally obtained through fully connected layer for distinct classification or regression tasks. In the "pre-trained" folder, the vocabulary file "mol2vec_vocabs.txt" contains 3,352 sub structure identifiers and five special words [PAD], [UNK], [CLS], [SEP], and [MASK]. The "EIECTRA-train.py" file is used for pre-training the BERT model by learning the molecular embedding. The corpus for pre-training is the "e15_smile_train.txt" which is share on the website "https://figshare.com/articles/dataset/Compound_dataset_for_pre-training/19092248". Thus, in the "EIECTRA-train.py" file, the statement of "file_path = mol2vec_corpus_e15_small.txt" can be changed into "file_path = e15_smile_train.txt". The outputted intermediate result "fingerprints_smile_output256.tar.gz" of learned molecular embedding is shared on the website "https://figshare.com/articles/software/fingerprints_smile_output256_tar_gz/19609440". The file of "my_tokenizers2.py" is mainly used to define the tokenizer class. In the folder of "original-dataset", we share the original datasets for downstream tasks of classification and regression. The "fp2bert" folder contains the "preprocessing" folder and the "code" folder. In the "preprocessing" folder, we share the ".ipynb" files for fine-tuning the BERT model the according to specific downstream datasets, and the ".npy" files of the intermediate BERT models for downsteam tasks are share on the website "https://figshare.com/articles/dataset/FP2BERT_embedding/19573084". In the "code" folder, we share the ".ipynb" code files for downstream tasks.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published