initial commit

HankerWu · Aug 13, 2022 · 69ead98 · 69ead98
1 parent 56cb4d9
commit 69ead98
Show file tree

Hide file tree

Showing 342 changed files with 81,178 additions and 1 deletion.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,125 @@
+# JetBrains PyCharm IDE
+.idea/
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# macOS dir files
+.DS_Store
+
+# Distribution / packaging
+.Python
+env/
+build/
+code_for_debug/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+
+
+# PyInstaller
+# Usually these files are written by a python script from a template
+# before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+.hypothesis/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# pyenv
+.python-version
+
+# celery beat schedule file
+celerybeat-schedule
+
+# SageMath parsed files
+*.sage.py
+
+# dotenv
+.env
+
+# virtualenv
+.venv
+venv/
+ENV/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+
+# Generated files
+checkpoints/
+
+# VS Code project files
+/.vscode/
+
+# model
+/gpt_model/checkpoint_best.pt
+/model/checkpoint_best.pt
+
+# output
+/output/
+
+nohup.out
+.ipynb
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) Facebook, Inc. and its affiliates.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -1,4 +1,109 @@
 # TamGent
 Tailoring Molecules for Protein Pockets: a Transformer-based Generative Solution for Structured-based Drug Design
 
-Code and data to be released...
+# Introduction
+
+Code base: [fairseq-v0.8.0](https://github.com/facebookresearch/fairseq)
+
+Fairseq(-py) is a sequence modeling toolkit that allows researchers and
+developers to train custom models for translation, summarization, language
+modeling and other text generation tasks.
+
+# Installation
+
+```bash
+git clone https://github.com/HankerWu/TamGent.git
+cd TamGent
+git checkout main
+
+conda create -n TamGent python=3.7
+conda activate TamGent
+conda install rdkit -c conda-forge -y
+python -m pip install -e .[chem]
+```
+
+# Dataset
+
+The dataset is available at [data](https://microsoftapc-my.sharepoint.com/:f:/g/personal/v-kehanwu_microsoft_com/EmcBPtAwq1JNvgWCRkTsVzwB3vKWh12GXucGA8wtZL0Lnw?e=y7DYRn).
+
+## Build customized dataset
+
+You can build your customized dataset through the following methods:
+
+1. Build customized dataset based on pdb ids, the script will automatically find the binding sites according to the ligands in the structure file.
+
+ ```bash
+ python scripts/build_data/prepare_pdb_ids.py ${PDB_ID_LIST} ${DATASET_NAME} -o ${OUTPUT_PATH} -t ${threshold}
+ ```
+
+ `PDB_ID_LIST` format: CSV format with columns ([] means optional):
+
+ `pdb_id,[ligand_inchi,uniprot_id]`
+2. Build customized dataset based on pdb ids using the center coordinates of the binding site of each pdb.
+
+ ```bash
+ python scripts/build_data/prepare_pdb_ids_center.py ${PDB_ID_LIST} ${DATASET_NAME} -o ${OUTPUT_PATH} -t ${threshold}
+ ```
+
+ `PDB_ID_LIST` format: CSV format with columns ([] means optional):
+
+ `pdb_id, center_x, center_y, center_z, [uniprot_id]`
+3. Build dataset from PDB ID list using the residue ids(indexes) of the binding site of each pdb.
+
+ ```bash
+ python scripts/build_data/prepare_pdb_ids_res_ids.py ${PDB_ID_LIST} ${DATASET_NAME} -o ${OUTPUT_PATH} --res-ids-fn ${RES_IDS_FN}
+ ```
+
+ `PDB_ID_LIST` format: CSV format with columns ([] means optional):
+
+ `pdb_id,[uniprot_id]`
+
+ `RES_IDS_FN` format: residue ids filename, a dict like:
+
+ ```python
+ {
+ 0:
+ {
+ chain_id_A: Array[res_id_A1, res_id_A2, ...],
+ chain_id_B: Array[res_id_B1, res_id_B2, ...],
+ ...
+ },
+ 1:
+ {
+ ...
+ },
+ ...
+ } 
+ ```
+
+ stored as pickle file. The order is the same as `PDB_ID_LIST`.
+
+ For customized pdb strcuture files, you can put your structure files to the `--pdb-path` folder, and in the `PDB_ID_LIST` csv file, put the filenames in the `pdb_id` column.
+
+# Model
+
+The pretrained model is available at [model](https://microsoftapc-my.sharepoint.com/:f:/g/personal/v-kehanwu_microsoft_com/EipAXgQfu6lPm1y2OP1ZUyEBsqQbPZ7aukhJ8_hgUej0yw?e=6XoImh).
+
+# Run scripts
+
+```bash
+# train a new model
+bash scripts/train.sh -D ${DATA_PATH} --savedir ${SAVED_MODEL_PATH}
+
+# generate molecules
+bash scripts/generate.sh -b ${BEAM_SIZE} -s ${SEED} -D ${DATA_PATH} --dataset ${TESTSET_NAME} --ckpt ${MODEL_PATH} --savedir ${OUTPUT_PATH}
+
+```
+
+# Citation
+
+Please cite as:
+
+```bibtex
+@inproceedings{TamGent,
+ title = {Tailoring Molecules for Protein Pockets: A Transformer-based Generative Solution for Structured-based Drug Design},
+ author = {Kehan Wu, Yingce Xia, Yang Fan, Pan Deng, Lijun Wu, Shufang Xie, Tong Wang, Haiguang Liu, Tao Qin and Tie-Yan Liu},
+ year = {2022},
+}
+```
+