pqsar2cpd - de novo generation of hit-like molecules from pQSAR pIC50 with AI-based generative chemistry
This repository contains the code of the conditional generative adversarial network capable of translating pQSAR profiles of pIC50 values into novel chemical structures, as described in [1]
The model itself operates entirely in the latent space. This means users can use any external molecular encoder/decoder to encode the molecules into vectors for training, and decode the output back to SMILES after inference. This way, pqsar2cpd can be implemented into any existing pipeline seamlessly. We have succesfully tested the approach with CDDD, JT-VAE, HierVAE, and MoLeR.
Since the model is input-agnostic, other property profiles, such as gene expression profiles or protein embeddings, could potentially be used instead of pQSAR to generate novel compounds.
pqsar2cpd is implemented in Tensorflow. To make sure all your packages are compatible, you can install the dependencies using the provided requirements file:
pip install -r requirements.txt
To train a new model, you need a set of compound vectors coming from a molecular encoder, and a matching set of property profiles. The compound and profile sets should be separate numpy arrays containing n-dimensional vectors, one row per compound, with 1:1 correspondence in indexing. If you're interested in using pQSAR profiles, you can follow the instructions in the pQSAR repository.
To use the model out of the box, save the compounds and profiles as separate .npy files with NumPy.
To train the model, run:
python train.py --compounds='cpd.npy' --profiles='profiles.npy'
you can also specify an optional argument for the number of epochs, e.g. --epochs=400
.
The script will train the cGAN, and save the generator as pqsar2cpd.h5, which will be ready for use in inference.
To generate novel molecules out of a set of profiles, run:
python predict.py --model='pqsar2cpd.h5' --profiles='test.npy' --output='new_mols.h5' --n_samples=100
This will load the profile numpy array from test.npy
and will generate 100 samples for each of the profiles in the set. Then, the results will be saved in new_mols.h5
in hdf5 format, with the samples stored as a dataset with the profile index as key. These can now be passed to the molecular decoder to get the SMILES.
Code authored by Michal Pikusa
Contributions: Florian Nigsch, W. Armand Guiguemde, Eric Martin, William J. Godinez, Christian Kolter
[1] De-novo generation of novel phenotypically active molecules for Chagas disease from biological signatures using AI-driven generative chemistry
Michal Pikusa, Olivier René, Sarah Williams, Yen-Liang Chen, Eric Martin, William J. Godinez, Srinivasa P S Rao, W. Armand Guiguemde, Florian Nigsch
bioRxiv 2021.12.10.472084; doi: https://doi.org/10.1101/2021.12.10.472084