TamGent

Tailoring Molecules for Protein Pockets: a Transformer-based Generative Solution for Structured-based Drug Design

Introduction

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks.

Installation

git clone https://github.com/HankerWu/TamGent.git
cd TamGent
git checkout main

conda create -n TamGent python=3.7 -y
conda activate TamGent
conda install rdkit -c conda-forge -y
python -m pip install -e .[chem]

Dataset

The dataset is available at data.

Build customized dataset

You can build your customized dataset through the following methods:

Build customized dataset based on pdb ids, the script will automatically find the binding sites according to the ligands in the structure file.
```
python scripts/build_data/prepare_pdb_ids.py ${PDB_ID_LIST} ${DATASET_NAME} -o ${OUTPUT_PATH} -t ${threshold}
```
PDB_ID_LIST format: CSV format with columns ([] means optional):

pdb_id,[ligand_inchi,uniprot_id]
Build customized dataset based on pdb ids using the center coordinates of the binding site of each pdb.
```
python scripts/build_data/prepare_pdb_ids_center.py ${PDB_ID_LIST} ${DATASET_NAME} -o ${OUTPUT_PATH} -t ${threshold}
```
PDB_ID_LIST format: CSV format with columns ([] means optional):

pdb_id, center_x, center_y, center_z, [uniprot_id]
Build dataset from PDB ID list using the residue ids(indexes) of the binding site of each pdb.
```
python scripts/build_data/prepare_pdb_ids_res_ids.py ${PDB_ID_LIST} ${DATASET_NAME} -o ${OUTPUT_PATH} --res-ids-fn ${RES_IDS_FN}
```
PDB_ID_LIST format: CSV format with columns ([] means optional):

pdb_id,[uniprot_id]

RES_IDS_FN format: residue ids filename, a dict like:
```
{
  0:
    {
      chain_id_A: Array[res_id_A1, res_id_A2, ...],
      chain_id_B: Array[res_id_B1, res_id_B2, ...],
      ...
    },
  1:
    {
      ...
    },
  ...
}  
```
stored as pickle file. The order is the same as PDB_ID_LIST.

For customized pdb strcuture files, you can put your structure files to the --pdb-path folder, and in the PDB_ID_LIST csv file, put the filenames in the pdb_id column.

Model

The pretrained model is available at model.

Run scripts

# train a new model
bash scripts/train.sh -D ${DATA_PATH} --savedir ${SAVED_MODEL_PATH}

# generate molecules
bash scripts/generate.sh -b ${BEAM_SIZE} -s ${SEED} -D ${DATA_PATH} --dataset ${TESTSET_NAME} --ckpt ${MODEL_PATH} --savedir ${OUTPUT_PATH}

Citation

Please cite as:

@inproceedings{TamGent,
  title = {Tailoring Molecules for Protein Pockets: A Transformer-based Generative Solution for Structured-based Drug Design},
  author = {Kehan Wu, Yingce Xia, Yang Fan, Pan Deng, Lijun Wu, Shufang Xie, Tong Wang, Haiguang Liu, Tao Qin and Tie-Yan Liu},
  year = {2022},
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
SupplementaryData		SupplementaryData
dataset		dataset
dict		dict
fairseq		fairseq
fairseq_cli		fairseq_cli
model		model
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval_lm.py		eval_lm.py
generate.py		generate.py
hubconf.py		hubconf.py
interactive.py		interactive.py
preprocess.py		preprocess.py
score.py		score.py
setup.py		setup.py
train.py		train.py
validate.py		validate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TamGent

Introduction

Installation

Dataset

Build customized dataset

Model

Run scripts

Citation

About

Releases

Packages

Languages

License

HankerWu/TamGent

Folders and files

Latest commit

History

Repository files navigation

TamGent

Introduction

Installation

Dataset

Build customized dataset

Model

Run scripts

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages