Official implementation of ICML 2024 "MolCRAFT: Structure-Based Drug Design in Continuous Parameter Space".
Check our demo: coming soon
You will need a host machine with gpu, and a docker with nvidia-container-runtime
enabled.
Tip
- This repo provides an easy-to-use script to install docker and nvidia-container-runtime, in
./docker
runsudo ./setup_docker_for_host.sh
to set up your host machine. - For details, please refer to the install guide.
We highly recommend you to set up the environment via docker, since all you need to do is a simple make
command.
cd ./docker
make
Note
- The
make
will automatically build the docker image and run the container, inside which your host home directory will be mounted to${HOME}/home
. The docker image supports running any command inside a tmux session. - If
make
exits with error messages with pip, trymake
again in case of any network connection timeout. - Once
make
succeeds, note that exiting from container wound't stop it, nor the tmux sessions running within. To enter an existing container, simply runmake
again. If you need to stop this container, runmake kill
. - For customized environment setup, please refer to files
docker/Dockerfile
,docker/asset/requirements.txt
anddocker/asset/apt_packages.txt
. - If your bash does not highlight properly, try
sh -c "$(wget https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh -O -)"
to reinstall zsh template. - PoseCheck should be manually installed. See instructions below.
You can also build your own environment through conda env create -f environment.yml
. Here we list the main packages we used, in case you want a different version:
conda create -n molcraft
conda activate molcraft
conda install pytorch pytorch-cuda=11.6 -c pytorch -c nvidia
conda install lightning -c conda-forge
conda install pyg -c pyg
conda install rdkit openbabel pyyaml easydict python-lmdb -c conda-forge
For evaluation, you will need to install vina
(affinity), posecheck
(clash, strain energy, and key interactions), and spyrmsd
(rmsd).
# for vina docking
pip install meeko==0.1.dev3 scipy pdb2pqr vina==1.2.2
python -m pip install git+https://github.com/Valdes-Tresanco-MS/AutoDockTools_py3
# for posecheck evaluation
git clone https://github.com/cch1999/posecheck.git
cd posecheck
git checkout 57a1938
pip install -e .
pip install -r requirements.txt
conda install -c mx reduce
# for spyrmsd
conda install spyrmsd -c conda-forge
Note
- If you encounter vina fail, please check
/opt/conda/lib/python3.9/site-packages/vina/vina.py
, line 260, change toastype(np.int64)
- The latest version of PoseCheck uses a different algorithm for calculating strain energy, and installing commit
57a1938
will reproduce our results.
Posecheck may fail to load protein when multiple processes access the same pdb file. A hot fix is by preprocessing all pdb files in advance:
# in posecheck/posecheck/utils/loading.py:
# change line 62-68 to
...
if not os.path.exists(tmp_path):
# Call reduce to make tmp PDB with waters
reduce_command = f"{reduce_path} -NOFLIP {pdb_path} -Quiet > {tmp_path}"
subprocess.run(reduce_command, shell=True)
# Load the protein from the temporary PDB file
prot = load_protein_prolif(tmp_path)
# os.remove(tmp_path)
...
# adding the following 2 functions
# preprocess pdb files
def prepare_protein_from_pdb(pdb_path_list: list[str], reduce_path: str = REDUCE_PATH):
for pdb_path in pdb_path_list:
tmp_path = pdb_path.split(".pdb")[0] + "_tmp.pdb"
# Call reduce to make tmp PDB with waters
reduce_command = f"{reduce_path} -NOFLIP {pdb_path} -Quiet > {tmp_path}"
subprocess.run(reduce_command, shell=True)
return
# after pose check, you can remove all tmp files by hand:
def remove_protein_tmp_file(pdb_path_list: list[str]):
for pdb_path in pdb_path_list:
tmp_path = pdb_path.split(".pdb")[0] + "_tmp.pdb"
if os.path.exists(tmp_path):
os.remove(tmp_path)
return
/checkpoints
: The official checkpointlast.ckpt
(43M) will be automatically cloned here./configs
: We use yaml file to manage configs for model, directory, data, train, and evaluation. Some of the parameters are provided as input arguments (e.g.,--test_only --no_wandb
), and will be automatically updated and converted to aconfig
object./core
: The main code directory./callbacks
: pytorch-lightning callbacks for validation, docking, etc./evaluation
: Basic functions for evaluating conformation, affinity, etc./models
:bfn_base.py
: BFN base class, implemented with Bayesian update and various loss.bfn4sbdd.py
: The score model for SBDD, implemented with SBDD output, loss, sampling.sbdd_train_loop.py
: ALightningModule
, implemented withtraining_step
,validation_step
,test_step
andsampling_step
.uni_transformer.py
: backbone network, same as TargetDiff.
/utils
: util functions for reconstructing molecule from point cloud, featurizing protein-ligand data, etc.
/logs
: The default output folder, containing checkpoints, generated molecules, evaluation results, etc./test
: Evaluation code.sample_for_pocket.py
, scripts.mk, train_bfn.py: entry scripts.
Data used for training / evaluating the model should be put in the data
folder by default, and accessible in the data Google Drive folder.
To train the model from scratch, download the lmdb file and split file into data folder:
crossdocked_v1.1_rmsd1.0_pocket10_processed_final.lmdb
crossdocked_pocket10_pose_split.pt
To evaluate the model on the test set, download and unzip the test_set.zip
into data folder. It includes the original PDB files that will be used in Vina Docking.
By default, We transform the lmdb further into the featurized dataset as crossdocked_v1.1_rmsd1.0_pocket10_add_aromatic_transformed_simple.pt
as described in transform.py
, which might take several minutes. To enable accelerated training, the yaml file will be set as follows:
data:
name: pl_tr # [pl, pl_tr] where tr means offline-transformed
Run make -f scripts.mk
(without the need for data preparation), or alternatively (with data folder correctly configured),
python train_bfn.py --exp_name {EXP} --revision {REVISION}
where the default values should be set the same as:
python train_bfn.py --sigma1_coord 0.03 --beta1 1.5 --lr 5e-4 --time_emb_dim 1 --epochs 15 --max_grad_norm Q --destination_prediction True --use_discrete_t True --num_samples 10 --sampling_strategy end_back_pmf
Note
The default validation during training uses:
epochs = 15
: It takes around 1 day and 20G GPU memory on a single 3090. 15 epochs is ~187k steps, which is close to TargetDiff.sample_num_atoms = ref
:ref
mode will use the reference molecule to determine atom numbers before generation. This mode produces more stable results thanprior
mode (same as TargetDiffprior
mode). In general, the average numbers of atoms of molecules generated by the two modes are similar (~23).sample_steps = 100
: Empirically, 100 steps are good enough for our model (previous diffusion models usually sample 1000 steps). The reported results of our model are also sampled in 100 steps. Besides, 100 steps save a lot of validation time.num_samples = 5
: In order to save time, we only sample 1 molecules for validation and 5 molecules for testing. For formal evaluation, change tonum_samples = 100
. You can skip energy and clash computation incore/evaluation/metrics.py
(line 184-203) to speed up validation.log_dir = logs/xxx/exp_name/revision
: The wandb logs, checkpoints, and generated samples are stored in this folder by default.- Post-processing: We use openbabel to add bonds for generated point clouds (
core/utils/reconstruct.py
, legacy code from TargetDiff). And we filter out incomplete and invalid (rdkit.Chem.SanitizeMol fail) molecules and only evaluate the remaining ones (> 97% of the samples).
For quick evaluation of the official checkpoint, refer to make evaluate
in scripts.mk
:
python train_bfn.py --test_only --no_wandb --ckpt_path ./checkpoints/last.ckpt
By default, your checkpoints are stored as logs/xxx/exp_name/revision/checkpoints/epochxxx.ckpt
. Use --ckpt_path logs/xxx/exp_name/revision/checkpoints/epochxxx.ckpt
will automatically change log_dir
to logs/xxx/exp_name/revision
. And the test results are saved in logs/xxx/exp_name/revision/test_outputs_v?
, where _v?
is to distinguish different runs.
Note
- Molecule size (number of non-H atoms) has a large impact on many metrics. You should control the average molecule sizes before comparing different models.
For quick debugging training process, run make debug -f scripts.mk
:
python train_bfn.py --no_wandb --debug --epochs 1
We provide the pretrained checkpoint as last.ckpt.
Run make evaluate -f scripts.mk
, or alternatively,
python train_bfn.py --config_file configs/default.yaml --exp_name {EXP} --revision {REVISION} --test_only --num_samples {NUM_MOLS_PER_POCKET} --sample_steps 100
The output molecules will be saved in ${HOME}/home/logs/{exp_name}/{revision}/test _outputs/mols
. If the sampling is run multiple times, there will be several mols-v{version_number}
folders.
To sample from a whole protein pdb file, we need the corresponding reference ligand to clip the protein pocket (a 10A region around the reference ligand).
Here is an example. The first argument is the protein pdb path, the second argument is the ligand sdf path, and you can choose your own checkpoint by changing the parameters in call()
of sample_for_pocket_v2.py
.
python sample_for_pocket_v2.py data/test_set/ABL2_HUMAN_274_551_0/4xli_B_rec.pdb data/test_set/ABL2_HUMAN_274_551_0/4xli_B_rec_4xli_1n1_lig_tt_min_0.sdf
For binding affinity (Vina Score / Min / Dock) and molecular properties (QED, SA), it is calculated upon sampling.
For PoseCheck (strain energy, clashes) and other conformational results (bond length, bond angle, torsion angle, RMSD), please refer to test
folder.
We provide samples for all SBDD baselines in the sample Google Drive folder.
You may download the all_samples.tar.gz
and then tar xzvf all_samples.tar.gz
, which extracts all the pt files into samples
folder for evaluation.
With gradio
and gradio_molecule3d
installed, you can simply run python app.py
to open the demo locally. Port mapping has been set in Makefile if you are using docker. You should also forward this port if you run the docker in an ssh server. We will share a permanent demo link later.
Great thanks to @duerrsimon for his kind support in resolving rendering issues!
@article{qu2024molcraft,
title={MolCRAFT: Structure-Based Drug Design in Continuous Parameter Space},
author={Qu, Yanru and Qiu, Keyue and Song, Yuxuan and Gong, Jingjing and Han, Jiawei and Zheng, Mingyue and Zhou, Hao and Ma, Wei-Ying},
journal={ICML 2024},
year={2024}
}
@article{song2024unified,
title={Unified Generative Modeling of 3D Molecules via Bayesian Flow Networks},
author={Song, Yuxuan and Gong, Jingjing and Qu, Yanru and Zhou, Hao and Zheng, Mingyue and Liu, Jingjing and Ma, Wei-Ying},
journal={ICLR 2024},
year={2024}
}