This document describes the codes used in the paper and how to replicate the result.
The following packages are required. Note that the latest versions have now possibly become incompatible with our codes, and older versions in these ranges may work.
- python>=3.7
- numpy>=1.21
- pandas>=1.2
- tqdm>=4.63
- addict>=2.4
- rdkit==2023.03
- PyTorch==1.12
You can build a conda environment using requirements.txt except PyTorch like the following commands. For PyTorch package, please install the proper version according to your GPU environment.
conda create -n transformervae python==3.7
conda activate transformervae
conda install pip
pip install -r requirements.txt
To train the model with SMILES dataset, you first need to canonicalize and randomize SMILES and then tokenize them.
python preprocess.py --processname <processed data name> --data path_to_smiles_data.txt
<processed data name>
can be an arbitrary string. Processed data will be stored in./preprocess/results/<processed data name>
.- Input SMILES file ("path_to_smiles_data.txt") must contain one SMILES string in each row without header.
- You have to process train and validation data for training.
Transformer VAE model can be trained by the processed random/canonical SMILES
python train.py --name example --train_data <processed train data name> --val_data <processed validation data name>
Trained model weight can be downloaded from Google Drive.
Currently, the following models are available:
- moses: Fully trained model trained by MOSES dataset.
- zinc: Fully trained model trained by ZINC-15 dataset.
Weights of each module in the model is stored separetely in the above directories.
You can get estimated mean of posterior distribution of latent variables as descriptor of molecules using featurize.py
python featurize.py --data <processed data name> -- path_to_model_weight_dir --name <feature name>
<feature name>
can be an arbitrary string. Feature will be stored in./featurization/results/<feature name>/feature.csv
. This file contains feature of each molecule in each row without index column (A header row exists).
Molecules can be generated by your trained or downloaded model.
python generation/generate.py --weight path_to_model_weight_dir --n 30000 --name <generation name>
- Generated molecules will be stored in
generation/results/<generation name>/smiles.txt
Molecules can be decoded from arbitrary latent variables using decode.py
.
python decode.py --latent path_to_latent_variables --weight path_to_model_weight_dir --name <decoding name>
- Latent variables to be decoded must be prepared in a csv format. Each row has to contain latent variables of one molecule without index column (A header row has to exist). A feature file made by
featurization.py
satisfies this format.
The path to the prepared latent variables file has to be specified in--latent
option. <decoding name>
can be an arbitrary string. Decoded SMILES will be stored in./decoding/results/<decoding name>/decoded_smiles.txt
. This file contains SMILES of each molecule without header.