Generating customized compound libraries for drug discovery with machine intelligence

Abstract of the paper: Generative machine learning models sample drug-like molecules from chemical space without the need for explicit design rules. A deep learning framework for customized compound library generation is presented, aiming to enrich and expand the pharmacologically relevant chemical space with new molecular entities ‘on demand’. This de novo design approach was used to generate molecules that combine features from bioactive synthetic compounds and natural products, which are a primary source of inspiration for drug discovery. The results show that the data-driven machine intelligence acquires implicit chemical knowledge and generates novel molecules with bespoke properties and structural diversity. The method is available as an open-access tool for medicinal and bioorganic chemistry.

Keywords: artificial intelligence; drug design; generative model; natural product; neural network; language model

Requirements

First, you need to clone the repo:

git clone [email protected]:moretm/virtual_libraries.git

Then, you can run the following command which will create a conda virtual environement and install all the needed packages:

If you are working with a Mac:

cd virtual_libraries/
sh install_mac.sh

If you are working with a linux system:

cd virtual_libraries/
sh install_linux.sh

Once the installation is done, you can activate the virtual conda environement for this project:

conda activate mollib

Please note that you will need to activate this virtual conda environement every time you want to use this project.

How to run an example experiment

Now, you can run an example experiment, e.g., the one with five dissimilar molecules from MEGx in the paper:

cd experiments/
sh run_morty.sh ../data/paper_five_dissimilar_MEGx.txt

The results of the analysis can be found in experiments/results/paper_five_dissimilar_MEGx/

How to run an experiment on your own data

To apply transfer learning on your own set of molecules, you will need a .txt file with one SMILES string per line. Then, you can just run the following command:

sh run_morty.sh {path/to/your/file.txt}

You will find the results of your experiment in experiments/results/{name_of_your_file}/

What can be found in the results folder?

1. A figure displaying the results, as in the paper:

(in experiments/results/{name_of_your_file}/resume/resume_0.7.jpg)
a. Fréchet ChemNet Distance (FCD) plot showing how the generated molecules evolved during transfer learning with respect to the source space (ChEMBL24) and the target space (in the paper: MEGx natural products).
b. Boxplot displaying the evolution of the CSP3 fraction during transfer learning.
c. Uniform Manifold Approximation and Projection (UMAP), which displays molecules from the training set (ChEMBL24) used to pretrain the chemical language model (CLM), molecules from the target set (MEGx), molecules generated from the pretrained CLM, molecules generated at the last epoch of transfer learning and finally molecules of the transfert learning set.
d. The five most common scaffolds for which molecules were generated, along with their scaled Shannon entropy (SSE) and the time during which they were present (in percent with respect to all sampled molecules at the given epoch).

2. An interactive UMAP plot that can be opened in the browser:

When the example run has finished, the interactive map of chemical space can be found there: experiments/results/paper_five_dissimilar_MEGx/umap/interative_umap.html or in experiments/results/{name_of_your_file}/umap/interative_umap.html if you ran an experiment on your own data). You can open this html file with a web browser. This interactive map allows you to zoom in our out and to see some properties of the molecules when you'll pass over them with your pointer. Finally, if you click on a molecule, it will open it on https://molview.org where you can see its structure.

3. The generated novo molecules by the chemical language model:

You can find the list of molecules (.txt file with one SMILES string per line) in experiments/results/{name_of_your_file}/novo_molecules/. Each of those molecules are grammatically valid and not found in the training set. Molecules were sampled every 10 epochs (10, 20, 30 and 40) during transfer learning.

How is the pipeline working?

In experiments/ you can find all the files used to process the data, train the chemical language model and analyze the results. All do_{something}.py files are where the actual code resides. All those files are run from a bash script called run_morty.sh, which executes other bash scripts called run_do_{something}.sh. Each of those run_do_{something}.sh files takes care of the compute pipeline, namely (i) preprocessing, (ii) training, (iii) data generation, and (iv) data analysis. You can use any of these scripts seperately if needed.
The parameters of the neural network, how to do the processing, etc ., are defined in the file parameters.init. If you want, you can play with it. However, keep in mind that you will need access to a GPU if you wish to (re)train a language model, which is already provided here (otherwise, it will take a very long time to run the full pipeline). Note that there are some additional parameters and helper functions defined in src/python/.

FAQ

Do I need a GPU to run this code?
No, you don't. We already trained the chemical language model, which is the part for which you really need a GPU. Even though it might take some time to sample new molecules, you can do it on your own computer – which was one of our goals.

I have never used a terminal, nor do I know what Git really is. What should I do?
If your goal is just to use our project to generate molecules, then we suggest that you use the release we made on Code Ocean (TODO when manuscript accepted). There, you have the possibility to run the code without having to install anything.

Can I use this code if I work with Windows?
We did not test our code with Windows. If you try, please let us know what was the modifications you had to make to the requirements such that everybody can benefit from your experience!

How to cite this work

TODO: update when published.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
experiments		experiments
models		models
src/python		src/python
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
environment_linux.yml		environment_linux.yml
environment_mac.yml		environment_mac.yml
install_linux.sh		install_linux.sh
install_mac.sh		install_mac.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generating customized compound libraries for drug discovery with machine intelligence

Table of Contents

Description

Requirements

How to run an example experiment

How to run an experiment on your own data

What can be found in the results folder?

1. A figure displaying the results, as in the paper:

2. An interactive UMAP plot that can be opened in the browser:

3. The generated novo molecules by the chemical language model:

How is the pipeline working?

FAQ

How to cite this work

License

About

Releases

Packages

Languages

License

grisoniFr/virtual_libraries

Folders and files

Latest commit

History

Repository files navigation

Generating customized compound libraries for drug discovery with machine intelligence

Table of Contents

Description

Requirements

How to run an example experiment

How to run an experiment on your own data

What can be found in the results folder?

1. A figure displaying the results, as in the paper:

2. An interactive UMAP plot that can be opened in the browser:

3. The generated novo molecules by the chemical language model:

How is the pipeline working?

FAQ

How to cite this work

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages