Skip to content

Commit

Permalink
init
Browse files Browse the repository at this point in the history
  • Loading branch information
MinkaiXu committed Mar 25, 2022
1 parent 2595457 commit d76991d
Show file tree
Hide file tree
Showing 29 changed files with 4,317 additions and 4 deletions.
134 changes: 134 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# script
*.sh
# *.ipynb

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

154 changes: 150 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,154 @@
# GeoDiff
# GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation

[[OpenReview]](https://openreview.net/forum?id=PzcvxEMzvQC)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/MinkaiXu/GeoDiff/blob/main/LICENSE)

[[OpenReview](https://openreview.net/forum?id=PzcvxEMzvQC)] [[arXiv](https://arxiv.org/abs/2203.02923)] [[Code](https://github.com/MinkaiXu/GeoDiff)]

The official implementation of GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation (ICLR 2022 **Oral Presentation**)
The official implementation of GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation (ICLR 2022 **Oral Presentation [54/3391]**).

The code is coming soon. We have a primary version on [OpenReview](https://openreview.net/forum?id=PzcvxEMzvQC) as the supplymentary material, and the link is also copied [here](https://openreview.net/attachment?id=PzcvxEMzvQC&name=supplementary_material).
![cover](assets/geodiff_framework.png)

## Environments

### Install via Conda (Recommended)

```bash
# Clone the environment
conda env create -f env.yml
# Activate the environment
conda activate geodiff
# Install PyG
conda install pytorch-geometric=1.7.2=py37_torch_1.8.0_cu102 -c rusty1s -c conda-forge
```

## Dataset

### Offical Dataset
The offical raw GEOM dataset is avaiable [[here]](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/JNGTDF).

### Preprocessed dataset
We provide the preprocessed datasets (GEOM) in this [[google drive folder]](https://drive.google.com/drive/folders/1b0kNBtck9VNrLRZxg6mckyVUpJA5rBHh?usp=sharing). After downleading the dataset, it should be put into the folder path as specified in the `dataset` variable of config files `./configs/*.yml`.

### Prepare your own GEOM dataset from scratch (optional)

You can also download origianl GEOM full dataset and prepare your own data split. A guide is available at previous work ConfGF's [[github page]](https://github.com/DeepGraphLearning/ConfGF#prepare-your-own-geom-dataset-from-scratch-optional).

## Training

All hyper-parameters and training details are provided in config files (`./configs/*.yml`), and free feel to tune these parameters.

You can train the model with the following commands:

```bash
# Default settings
python train.py ./config/qm9_default.yml
python train.py ./config/drugs_default.yml
# An ablation setting with fewer timesteps, as described in Appendix D.2.
python train.py ./config/drugs_1k_default.yml
```

The model checkpoints, configuration yaml file as well as training log will be saved into a directory specified by `--logdir` in `train.py`.

## Generation

We provide the checkpoints of two trained models, i.e., `qm9_default` and `drugs_default` in the [[google drive folder]](https://drive.google.com/drive/folders/1b0kNBtck9VNrLRZxg6mckyVUpJA5rBHh?usp=sharing). Note that, please put the checkpoints `*.pt` into paths like `${log}/${model}/checkpoints/`, and also put corresponding configuration file `*.yml` into the upper level directory `${log}/${model}/`.

You can generate conformations for entire or part of test sets by:

```bash
python test.py ${log}/${model}/checkpoints/${iter}.pt \
--start_idx 800 --end_idx 1000
```
Here `start_idx` and `end_idx` indicate the range of the test set that we want to use. All hyper-parameters related to sampling can be set in `test.py` files. Specifically, for testing qm9 model, you could add the additional arg `--w_global 0.3`, which empirically shows slightly better results.

Conformations of some drug-like molecules generated by GeoDiff are provided below.

<p align="center">
<img src="assets/exp_drugs.png" />
</p>

## Evaluation

After generating conformations following the obove commands, the results of all benchmark tasks can be calculated based on the generated data.

### Task 1. Conformation Generation

The `COV` and `MAT` scores on the GEOM datasets can be calculated using the following commands:

```bash
python eval_covmat.py ${log}/${model}/${sample}/sample_all.pkl
```


### Task 2. Property Prediction

For the property prediction, we use a small split of qm9 different from the `Conformation Generation` task. This split is also provided in the [[google drive folder]](https://drive.google.com/drive/folders/1b0kNBtck9VNrLRZxg6mckyVUpJA5rBHh?usp=sharing). Generating conformations and evaluate `mean absolute errors (MAR)` metric on this split can be done by the following commands:

```bash
python ${log}/${model}/checkpoints/${iter}.pt --num_confs 50 \
--start_idx 0 --test_set data/GEOM/QM9/qm9_property.pkl
python eval_prop.py --generated ${log}/${model}/${sample}/sample_all.pkl
```

## Visualizing molecules with PyMol

Here we also provide a guideline for visualizing molecules with PyMol. The guideline is borrowed from previous work ConfGF's [[github page]](https://github.com/DeepGraphLearning/ConfGF#prepare-your-own-geom-dataset-from-scratch-optional).

### Start Setup

1. `pymol -R`
2. `Display - Background - White`
3. `Display - Color Space - CMYK`
4. `Display - Quality - Maximal Quality`
5. `Display Grid`
1. by object: use `set grid_slot, int, mol_name` to put the molecule into the corresponding slot
2. by state: align all conformations in a single slot
3. by object-state: align all conformations and put them in separate slots. (`grid_slot` dont work!)
6. `Setting - Line and Sticks - Ball and Stick on - Ball and Stick ratio: 1.5`
7. `Setting - Line and Sticks - Stick radius: 0.2 - Stick Hydrogen Scale: 1.0`

### Show Molecule

1. To show molecules

1. `hide everything`
2. `show sticks`

2. To align molecules: `align name1, name2`

3. Convert RDKit mol to Pymol

```python
from rdkit.Chem import PyMol
v= PyMol.MolViewer()
rdmol = Chem.MolFromSmiles('C')
v.ShowMol(rdmol, name='mol')
v.SaveFile('mol.pkl')
```


## Citation
Please consider citing the our paper if you find it helpful. Thank you!
```
@inproceedings{
xu2022geodiff,
title={GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation},
author={Minkai Xu and Lantao Yu and Yang Song and Chence Shi and Stefano Ermon and Jian Tang},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=PzcvxEMzvQC}
}
```

## Acknowledgement

This repo is built upon the previous work ConfGF's [[codebase]](https://github.com/DeepGraphLearning/ConfGF#prepare-your-own-geom-dataset-from-scratch-optional). Thanks Chence and Shitong!

## Contact

If you have any question, please contact me at [email protected] or [email protected].

## Known issues

1. The current codebase is not compatible with more recent torch-geometric versions.
2. The current processed dataset (with PyD data object) is not compatible with more recent torch-geometric versions.
Binary file added assets/exp_drugs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/geodiff_framework.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
38 changes: 38 additions & 0 deletions configs/drugs_1k_default.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
model:
type: diffusion # dsm and diffusion
network: dualenc
hidden_dim: 128
num_convs: 6
num_convs_local: 4
cutoff: 10.0
mlp_act: relu
beta_schedule: sigmoid
beta_start: 1.e-7
beta_end: 9.e-3
num_diffusion_timesteps: 1000
edge_order: 3
edge_encoder: mlp
smooth_conv: true

train:
seed: 2021
batch_size: 32
val_freq: 5000
max_iters: 10000000
max_grad_norm: 30000.0 # Different from QM9
anneal_power: 2.0
optimizer:
type: adam
lr: 1.e-3
weight_decay: 0.
beta1: 0.95
beta2: 0.999
scheduler:
type: plateau
factor: 0.6
patience: 10

dataset:
train: ./data/GEOM/Drugs/train_data_40k.pkl
val: ./data/GEOM/Drugs/val_data_5k.pkl
test: ./data/GEOM/Drugs/test_data_1k.pkl
38 changes: 38 additions & 0 deletions configs/drugs_default.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
model:
type: diffusion # dsm and diffusion
network: dualenc
hidden_dim: 128
num_convs: 6
num_convs_local: 4
cutoff: 10.0
mlp_act: relu
beta_schedule: sigmoid
beta_start: 1.e-7
beta_end: 2.e-3
num_diffusion_timesteps: 5000
edge_order: 3
edge_encoder: mlp
smooth_conv: true

train:
seed: 2021
batch_size: 32
val_freq: 5000
max_iters: 10000000
max_grad_norm: 30000.0 # Different from QM9
anneal_power: 2.0
optimizer:
type: adam
lr: 1.e-3
weight_decay: 0.
beta1: 0.95
beta2: 0.999
scheduler:
type: plateau
factor: 0.6
patience: 10

dataset:
train: ./data/GEOM/Drugs/train_data_40k.pkl
val: ./data/GEOM/Drugs/val_data_5k.pkl
test: ./data/GEOM/Drugs/test_data_1k.pkl
Loading

0 comments on commit d76991d

Please sign in to comment.