init

MinkaiXu · Mar 25, 2022 · d76991d · d76991d
1 parent 2595457
commit d76991d
Show file tree

Hide file tree

Showing 29 changed files with 4,317 additions and 4 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,134 @@
+# script
+*.sh
+# *.ipynb
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+# Usually these files are written by a python script from a template
+# before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+# However, in case of collaboration, if having platform-specific dependencies or dependencies
+# having no cross-platform support, pipenv may install dependencies that don't work, or not
+# install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
diff --git a/README.md b/README.md
@@ -1,8 +1,154 @@
-# GeoDiff
+# GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation
 
-[[OpenReview]](https://openreview.net/forum?id=PzcvxEMzvQC)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/MinkaiXu/GeoDiff/blob/main/LICENSE)
 
+[[OpenReview](https://openreview.net/forum?id=PzcvxEMzvQC)] [[arXiv](https://arxiv.org/abs/2203.02923)] [[Code](https://github.com/MinkaiXu/GeoDiff)]
 
-The official implementation of GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation (ICLR 2022 **Oral Presentation**)
+The official implementation of GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation (ICLR 2022 **Oral Presentation [54/3391]**).
 
-The code is coming soon. We have a primary version on [OpenReview](https://openreview.net/forum?id=PzcvxEMzvQC) as the supplymentary material, and the link is also copied [here](https://openreview.net/attachment?id=PzcvxEMzvQC&name=supplementary_material).
+![cover](assets/geodiff_framework.png)
+
+## Environments
+
+### Install via Conda (Recommended)
+
+```bash
+# Clone the environment
+conda env create -f env.yml
+# Activate the environment
+conda activate geodiff
+# Install PyG
+conda install pytorch-geometric=1.7.2=py37_torch_1.8.0_cu102 -c rusty1s -c conda-forge
+```
+
+## Dataset
+
+### Offical Dataset
+The offical raw GEOM dataset is avaiable [[here]](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/JNGTDF).
+
+### Preprocessed dataset
+We provide the preprocessed datasets (GEOM) in this [[google drive folder]](https://drive.google.com/drive/folders/1b0kNBtck9VNrLRZxg6mckyVUpJA5rBHh?usp=sharing). After downleading the dataset, it should be put into the folder path as specified in the `dataset` variable of config files `./configs/*.yml`.
+
+### Prepare your own GEOM dataset from scratch (optional)
+
+You can also download origianl GEOM full dataset and prepare your own data split. A guide is available at previous work ConfGF's [[github page]](https://github.com/DeepGraphLearning/ConfGF#prepare-your-own-geom-dataset-from-scratch-optional).
+
+## Training
+
+All hyper-parameters and training details are provided in config files (`./configs/*.yml`), and free feel to tune these parameters.
+
+You can train the model with the following commands:
+
+```bash
+# Default settings
+python train.py ./config/qm9_default.yml
+python train.py ./config/drugs_default.yml
+# An ablation setting with fewer timesteps, as described in Appendix D.2.
+python train.py ./config/drugs_1k_default.yml
+```
+
+The model checkpoints, configuration yaml file as well as training log will be saved into a directory specified by `--logdir` in `train.py`.
+
+## Generation
+
+We provide the checkpoints of two trained models, i.e., `qm9_default` and `drugs_default` in the [[google drive folder]](https://drive.google.com/drive/folders/1b0kNBtck9VNrLRZxg6mckyVUpJA5rBHh?usp=sharing). Note that, please put the checkpoints `*.pt` into paths like `${log}/${model}/checkpoints/`, and also put corresponding configuration file `*.yml` into the upper level directory `${log}/${model}/`.
+
+You can generate conformations for entire or part of test sets by:
+
+```bash
+python test.py ${log}/${model}/checkpoints/${iter}.pt \
+ --start_idx 800 --end_idx 1000
+```
+Here `start_idx` and `end_idx` indicate the range of the test set that we want to use. All hyper-parameters related to sampling can be set in `test.py` files. Specifically, for testing qm9 model, you could add the additional arg `--w_global 0.3`, which empirically shows slightly better results.
+
+Conformations of some drug-like molecules generated by GeoDiff are provided below.
+
+<p align="center">
+ <img src="assets/exp_drugs.png" /> 
+</p>
+
+## Evaluation
+
+After generating conformations following the obove commands, the results of all benchmark tasks can be calculated based on the generated data.
+
+### Task 1. Conformation Generation
+
+The `COV` and `MAT` scores on the GEOM datasets can be calculated using the following commands:
+
+```bash
+python eval_covmat.py ${log}/${model}/${sample}/sample_all.pkl
+```
+
+
+### Task 2. Property Prediction
+
+For the property prediction, we use a small split of qm9 different from the `Conformation Generation` task. This split is also provided in the [[google drive folder]](https://drive.google.com/drive/folders/1b0kNBtck9VNrLRZxg6mckyVUpJA5rBHh?usp=sharing). Generating conformations and evaluate `mean absolute errors (MAR)` metric on this split can be done by the following commands:
+
+```bash
+python ${log}/${model}/checkpoints/${iter}.pt --num_confs 50 \
+ --start_idx 0 --test_set data/GEOM/QM9/qm9_property.pkl
+python eval_prop.py --generated ${log}/${model}/${sample}/sample_all.pkl
+```
+
+## Visualizing molecules with PyMol
+
+Here we also provide a guideline for visualizing molecules with PyMol. The guideline is borrowed from previous work ConfGF's [[github page]](https://github.com/DeepGraphLearning/ConfGF#prepare-your-own-geom-dataset-from-scratch-optional).
+
+### Start Setup
+
+1. `pymol -R`
+2. `Display - Background - White`
+3. `Display - Color Space - CMYK`
+4. `Display - Quality - Maximal Quality`
+5. `Display Grid`
+ 1. by object: use `set grid_slot, int, mol_name` to put the molecule into the corresponding slot
+ 2. by state: align all conformations in a single slot
+ 3. by object-state: align all conformations and put them in separate slots. (`grid_slot` dont work!)
+6. `Setting - Line and Sticks - Ball and Stick on - Ball and Stick ratio: 1.5`
+7. `Setting - Line and Sticks - Stick radius: 0.2 - Stick Hydrogen Scale: 1.0`
+
+### Show Molecule
+
+1. To show molecules
+
+ 1. `hide everything`
+ 2. `show sticks`
+
+2. To align molecules: `align name1, name2`
+
+3. Convert RDKit mol to Pymol
+
+ ```python
+ from rdkit.Chem import PyMol
+ v= PyMol.MolViewer()
+ rdmol = Chem.MolFromSmiles('C')
+ v.ShowMol(rdmol, name='mol')
+ v.SaveFile('mol.pkl')
+ ```
+
+
+## Citation
+Please consider citing the our paper if you find it helpful. Thank you!
+```
+@inproceedings{
+xu2022geodiff,
+title={GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation},
+author={Minkai Xu and Lantao Yu and Yang Song and Chence Shi and Stefano Ermon and Jian Tang},
+booktitle={International Conference on Learning Representations},
+year={2022},
+url={https://openreview.net/forum?id=PzcvxEMzvQC}
+}
+```
+
+## Acknowledgement
+
+This repo is built upon the previous work ConfGF's [[codebase]](https://github.com/DeepGraphLearning/ConfGF#prepare-your-own-geom-dataset-from-scratch-optional). Thanks Chence and Shitong!
+
+## Contact
+
+If you have any question, please contact me at [email protected] or [email protected].
+
+## Known issues
+
+1. The current codebase is not compatible with more recent torch-geometric versions.
+2. The current processed dataset (with PyD data object) is not compatible with more recent torch-geometric versions.
diff --git a/assets/exp_drugs.png b/assets/exp_drugs.png
diff --git a/assets/geodiff_framework.png b/assets/geodiff_framework.png
diff --git a/configs/drugs_1k_default.yml b/configs/drugs_1k_default.yml
@@ -0,0 +1,38 @@
+model:
+ type: diffusion # dsm and diffusion
+ network: dualenc
+ hidden_dim: 128
+ num_convs: 6
+ num_convs_local: 4
+ cutoff: 10.0
+ mlp_act: relu
+ beta_schedule: sigmoid
+ beta_start: 1.e-7
+ beta_end: 9.e-3
+ num_diffusion_timesteps: 1000
+ edge_order: 3
+ edge_encoder: mlp
+ smooth_conv: true
+
+train:
+ seed: 2021
+ batch_size: 32
+ val_freq: 5000
+ max_iters: 10000000
+ max_grad_norm: 30000.0 # Different from QM9
+ anneal_power: 2.0
+ optimizer:
+ type: adam
+ lr: 1.e-3
+ weight_decay: 0.
+ beta1: 0.95
+ beta2: 0.999
+ scheduler:
+ type: plateau
+ factor: 0.6
+ patience: 10
+
+dataset:
+ train: ./data/GEOM/Drugs/train_data_40k.pkl
+ val: ./data/GEOM/Drugs/val_data_5k.pkl
+ test: ./data/GEOM/Drugs/test_data_1k.pkl
diff --git a/configs/drugs_default.yml b/configs/drugs_default.yml
@@ -0,0 +1,38 @@
+model:
+ type: diffusion # dsm and diffusion
+ network: dualenc
+ hidden_dim: 128
+ num_convs: 6
+ num_convs_local: 4
+ cutoff: 10.0
+ mlp_act: relu
+ beta_schedule: sigmoid
+ beta_start: 1.e-7
+ beta_end: 2.e-3
+ num_diffusion_timesteps: 5000
+ edge_order: 3
+ edge_encoder: mlp
+ smooth_conv: true
+
+train:
+ seed: 2021
+ batch_size: 32
+ val_freq: 5000
+ max_iters: 10000000
+ max_grad_norm: 30000.0 # Different from QM9
+ anneal_power: 2.0
+ optimizer:
+ type: adam
+ lr: 1.e-3
+ weight_decay: 0.
+ beta1: 0.95
+ beta2: 0.999
+ scheduler:
+ type: plateau
+ factor: 0.6
+ patience: 10
+
+dataset:
+ train: ./data/GEOM/Drugs/train_data_40k.pkl
+ val: ./data/GEOM/Drugs/val_data_5k.pkl
+ test: ./data/GEOM/Drugs/test_data_1k.pkl