TransformerVAE

This document describes the codes used in the paper and how to replicate the result.

Environment

The following packages are required. Note that the latest versions have now possibly become incompatible with our codes, and older versions in these ranges may work.

python>=3.7
numpy>=1.21
pandas>=1.2
tqdm>=4.63
addict>=2.4
rdkit==2023.03
PyTorch==1.12

You can build a conda environment using requirements.txt except PyTorch like the following commands. For PyTorch package, please install the proper version according to your GPU environment.

conda create -n transformervae python==3.7
conda activate transformervae
conda install pip
pip install -r requirements.txt

Data preprocessing for training

To train the model with SMILES dataset, you first need to canonicalize and randomize SMILES and then tokenize them.

python preprocess.py --processname <processed data name> --data path_to_smiles_data.txt

<processed data name> can be an arbitrary string. Processed data will be stored in ./preprocess/results/<processed data name>.
Input SMILES file ("path_to_smiles_data.txt") must contain one SMILES string in each row without header.
You have to process train and validation data for training.

Training

Transformer VAE model can be trained by the processed random/canonical SMILES

python train.py --name example --train_data <processed train data name> --val_data <processed validation data name>

Trained model weights

Trained model weight can be downloaded from Google Drive.
Currently, the following models are available:

moses: Fully trained model trained by MOSES dataset.
zinc: Fully trained model trained by ZINC-15 dataset.

Weights of each module in the model is stored separetely in the above directories.

Featurization

You can get estimated mean of posterior distribution of latent variables as descriptor of molecules using featurize.py

python featurize.py --data <processed data name> -- path_to_model_weight_dir --name <feature name>

<feature name> can be an arbitrary string. Feature will be stored in ./featurization/results/<feature name>/feature.csv. This file contains feature of each molecule in each row without index column (A header row exists).

Molecule generation

Molecules can be generated by your trained or downloaded model.

python generation/generate.py --weight path_to_model_weight_dir --n 30000 --name <generation name>

Generated molecules will be stored in generation/results/<generation name>/smiles.txt

Decode molecule from feature

Molecules can be decoded from arbitrary latent variables using decode.py.

python decode.py --latent path_to_latent_variables --weight path_to_model_weight_dir --name <decoding name>

Latent variables to be decoded must be prepared in a csv format. Each row has to contain latent variables of one molecule without index column (A header row has to exist). A feature file made by featurization.py satisfies this format.
The path to the prepared latent variables file has to be specified in --latent option.
<decoding name> can be an arbitrary string. Decoded SMILES will be stored in ./decoding/results/<decoding name>/decoded_smiles.txt. This file contains SMILES of each molecule without header.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

usage.md

usage.md

TransformerVAE

Environment

Data preprocessing for training

Training

Trained model weights

Featurization

Molecule generation

Decode molecule from feature

Files

usage.md

Latest commit

History

usage.md

File metadata and controls

TransformerVAE

Environment

Data preprocessing for training

Training

Trained model weights

Featurization

Molecule generation

Decode molecule from feature