Skip to content

A geometry-complete diffusion generative model (GCDM) for structure-based drug design (Nature CommsChem)

License

Notifications You must be signed in to change notification settings

BioinfoMachineLearning/GCDM-SBDD

Repository files navigation

GCDM-SBDD

PyTorch Lightning Config: Hydra Paper Checkpoints DOI

Description

This is the official structure-based drug design (SBDD) codebase of the paper

Geometry-Complete Diffusion for 3D Molecule Generation and Optimization, Nature CommsChem

[arXiv] [Nature CommsChem]

Contents

System requirements

OS requirements

This package supports Linux. The package has been tested on the following Linux system: Description: AlmaLinux release 8.9 (Midnight Oncilla)

Python dependencies

This package is developed and tested under Python 3.10.x. The primary Python packages and their versions are as follows. For more details, please refer to the environment.yaml file.

hydra-core=1.3.2
matplotlib-base=3.7.1
numpy=1.24.3
pyg=2.3.0=py310_torch_2.0.0_cu118
python=3.10.11
pytorch=2.0.1=py3.10_cuda11.8_cudnn8.7.0_0
pytorch-scatter=2.1.1=py310_torch_2.0.0_cu118
pytorch-lightning=2.0.2
scikit-learn=1.2.2
torchmetrics=0.11.4

Installation guide

Install mamba (~500 MB: ~1 minute)

wget "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh  # accept all terms and install to the default location
rm Mambaforge-$(uname)-$(uname -m).sh  # (optionally) remove installer after using it
source ~/.bashrc  # alternatively, one can restart their shell session to achieve the same result

Install dependencies (~15 GB: ~10 minutes)

# clone project
git clone https://github.com/BioinfoMachineLearning/GCDM-SBDD
cd GCDM-SBDD

# create conda environment
mamba env create -f environment.yaml
conda activate GCDM-SBDD  # note: one still needs to use `conda` to (de)activate environments

# install local project as package
pip3 install -e .

Download checkpoints (~500 MB extracted: ~2 minutes)

Note: Make sure to be located in the project's root directory beforehand (e.g., ~/GCDM-SBDD/)

# fetch and extract model checkpoints directory
wget https://zenodo.org/record/10995319/files/GCDM_SBDD_Checkpoints.tar.gz
tar -xzf GCDM_SBDD_Checkpoints.tar.gz
rm GCDM_SBDD_Checkpoints.tar.gz

NOTE: Trained EGNN baseline checkpoint files are also included in GCDM_SBDD_Checkpoints.tar.gz.

QuickVina 2

For docking, download QuickVina 2 and copy it to your Conda environment's binary (bin) directory:

wget https://github.com/QVina/qvina/raw/master/bin/qvina2.1
chmod +x qvina2.1
mv qvina2.1 $HOME/mambaforge/envs/GCDM-SBDD/bin

We need MGLTools for preparing the receptor for docking (pdb -> pdbqt) but it can mess up your Mamba environment, so I recommend making a new one:

mamba create -n mgltools -c bioconda mgltools

Binding MOAD

Data preparation

Download the dataset

wget http:https://www.bindingmoad.org/files/biou/every_part_a.zip
wget http:https://www.bindingmoad.org/files/biou/every_part_b.zip
wget http:https://www.bindingmoad.org/files/csv/every.csv

unzip every_part_a.zip
unzip every_part_b.zip

Process the raw data using

python process_bindingmoad.py <bindingmoad_dir>

or, to suppress warnings,

python -W ignore process_bindingmoad.py <bindingmoad_dir>

CrossDocked Benchmark

Data preparation

Download and extract the dataset as described by the authors of Pocket2Mol: https://github.com/pengxingang/Pocket2Mol/tree/main/data

Process the raw data using

python process_crossdock.py <crossdocked_dir> --no_H

Tutorials

We provide a two-part tutorial series of Jupyter notebooks to provide users with a real-world example of how to use GCDM-SBDD for pocket-based molecule generation and filtering, as outlined below.

  1. Generating molecules in target protein pockets
  2. Filtering generated molecules using PoseBusters

Demo

Sample molecules for a given pocket

To sample small molecules for a given pocket with a trained model use the following command:

python generate_ligands.py <checkpoint>.ckpt --pdbfile <pdb_file>.pdb --outdir <output_dir> --resi_list <list_of_pocket_residue_ids>

For example:

python generate_ligands.py last.ckpt --pdbfile 1abc.pdb --outdir results/ --resi_list A:1 A:2 A:3 A:4 A:5 A:6 A:7 

Alternatively, the binding pocket can also be specified based on a reference ligand in the same PDB file:

python generate_ligands.py <checkpoint>.ckpt --pdbfile <pdb_file>.pdb --outdir <output_dir> --ref_ligand <chain>:<resi>

Optional flags:

Flag Description
--n_samples Number of sampled molecules
--all_frags Keep all disconnected fragments
--sanitize Sanitize molecules (invalid molecules will be removed if this flag is present)
--relax Relax generated structure in force field
--resamplings Inpainting parameter (doesn't apply if conditional model is used)
--jump_length Inpainting parameter (doesn't apply if conditional model is used)

Instructions for use

Training

Starting a new training run:

python -u train.py config=<config>.yml

Resuming a previous run:

python -u train.py config=<config>.yml resume=<checkpoint>.ckpt

Reproduce paper results

test.py can be used to sample molecules for the entire testing set:

python test.py <checkpoint>.ckpt --test_dir <bindingmoad_dir>/processed_noH/test/ --outdir <output_dir> --fix_n_nodes

Using the optional --fix_n_nodes flag lets the model produce ligands with the same number of nodes as the original molecule. Other optional flags are identical to generate_ligands.py.

Compute sample metrics

For assessing basic molecular properties create an instance of the MoleculeProperties class and run its evaluate method:

from analysis.metrics import MoleculeProperties
mol_metrics = MoleculeProperties()
all_qed, all_sa, all_logp, all_lipinski, per_pocket_diversity = \
    mol_metrics.evaluate(pocket_mols)

evaluate() expects a list of lists where the inner list contains all RDKit molecules generated for one pocket.

For computing docking scores, run QuickVina as described below.

Run QuickVina2

First, convert all protein PDB files to PDBQT files using MGLTools

conda activate mgltools
cd analysis
python2 docking_py27.py <bindingmoad_dir>/processed_noH/test/ <output_dir> bindingmoad
cd ..
conda deactivate

Then, compute QuickVina scores:

conda activate GCDM-SBDD
python3 analysis/docking.py --pdbqt_dir <docking_py27_outdir> --sdf_dir <test_outdir> --out_dir <qvina_outdir> --write_csv --write_dict --dataset moad

NOTE: One can reference analysis/inference_analysis.py and analysis/molecule_analysis.py to analyze the generated molecules.

Docker

To run this project in a Docker container, you can use the following commands:

## Build the image
docker build -t gcdm-sbdd .

## Run the container (with GPUs and mounting the current directory)
docker run -it --gpus all -v .:/mnt --name gcdm-sbdd gcdm-sbdd

Note: You will still need to download the checkpoints and data as described in the installation guide. Then, update the Python commands to point to the desired local location of your files (e.g., /mnt/checkpoints and /mnt/outputs) once in the container.

Acknowledgements

GCDM-SBDD builds upon the source code and data from the following projects:

We thank all their contributors and maintainers!

License

This project is covered under the MIT License.

Citation

If you use the code or data associated with this package or otherwise find this work useful, please cite:

@article{morehead2024geometry,
  title={Geometry-complete diffusion for 3D molecule generation and optimization},
  author={Morehead, Alex and Cheng, Jianlin},
  journal={Communications Chemistry},
  volume={7},
  number={1},
  pages={150},
  year={2024},
  publisher={Nature Publishing Group UK London}
}

About

A geometry-complete diffusion generative model (GCDM) for structure-based drug design (Nature CommsChem)

Resources

License

Stars

Watchers

Forks

Packages

No packages published