Skip to content

Magnushhoie/DiscoTope-3.0

Repository files navigation

Overview

DiscoTope-3.0 predicts epitopes on input protein structures, using inverse folding representations from the ESM-IF1 model. The tool accepts both solved and predicted structures in the PDB format, and outputs per-residue epitope propensity scores in a CSV format.

Webserver

To try DiscoTope-3.0 without installing it, please see our DTU Healthtech webserver

Repo contents

  • data: Example input files, including test set
  • discotope3: Source code
  • output: DiscoTope-3.0 output examples

Quickstart guide

# Setup environment and install
conda create --name inverse python=3.9 -y
conda activate inverse
conda install -c pyg pyg -y
conda install -c conda-forge pip -y

git clone https://github.com/Magnushhoie/discotope3_web/
cd discotope3_web/
pip install .

# Unzip models to use
unzip models.zip

# 1. Predict single PDB (solved structure)
python discotope3/main.py --pdb_or_zip_file data/example_pdbs_solved/7c4s.pdb
# CPU only:
python discotope3/main.py --cpu_only --pdb_or_zip_file data/example_pdbs_solved/7c4s.pdb

Installation guide

We highly recommend using an Ubuntu OS and Conda (miniconda or anaconda) for installing required dependencies.

Predictions are faster using a GPU and the recommended versions of pytorch, pytorch-geometric and cudatoolkit, but these exact versions are not required.

For Linux & GPU with conda (recommended, ~2 mins)

# Setup environment with conda
conda create -n inverse python=3.9
conda activate inverse
conda install pytorch=1.11 cudatoolkit=11.3 -c pytorch
conda install pyg -c pyg -c conda-forge
conda install pip

# install pip dependencies
pip install .

Linux & GPU with pip (~5 mins)

# install pip dependencies
pip install -r requirements_recommended.txt
pip install .

Recommended system requirements

Running DiscoTope-3.0

DiscoTope-3.0 can predict a single PDB, a folder or ZIP file of PDBs, or fetch PDBs using their IDs from RCSB or AlphafoldDB to predict them.

On a common workstation with a GPU, predictions takes <1 second per PDB chain with ~ 15 seconds for loading needed libraries and model weight.

Set the --struc_type parameter to 'solved' for experimentally solved structures (default) or 'alphafold' for modelled structures.

Note that DiscoTope-3.0 splits PDB structures into single chains before prediction, unless --multi_chain_mode is set.

# Unzip models
unzip models.zip

# Now select one of multiple options:

# 1. Predict single PDB (solved)
python discotope3/main.py --pdb_or_zip_file data/example_pdbs_solved/7c4s.pdb

# 2. Predict AlphaFold structure
python discotope3/main.py --pdb_or_zip_file data/example_pdbs_alphafold/7tdm_B.pdb --struc_type alphafold

# 3. Predict a folder of PDBs
python discotope3/main.py --pdb_dir data/example_pdbs_solved --out_dir output/example_pdbs_solved

# 4. Predict a ZIP file of PDBs
python discotope3/main.py --pdb_or_zip_file pdbs_in_zipfile.zip --out_dir output/pdbs_in_zipfile

# 5. Fetch PDBs from RCSB
python discotope3/main.py --list_file pdb_list_solved.txt --struc_type solved --out_dir output/pdb_list_solved

# 6. Fetch PDBs from Alphafolddb
python discotope3/main.py --list_file pdb_list_af2.txt --struc_type alphafold --out_dir output/pdb_list_af2

Predict B-cell epitope propensity on input protein PDB structures

optional arguments:
  -h, --help            show this help message and exit
  -f PDB_OR_ZIP_FILE, --pdb_or_zip_file PDB_OR_ZIP_FILE
                        Input file, either single PDB or compressed zip file with multiple PDBs
  --list_file LIST_FILE
                        File with PDB or Uniprot IDs, fetched from RCSB/AlphaFolddb
  --struc_type STRUC_TYPE
                        Structure type from file (solved | alphafold)
  --pdb_dir PDB_DIR     Directory with AF2 PDBs
  --out_dir OUT_DIR     Job output directory
  --models_dir MODELS_DIR
                        Path for .json files containing trained XGBoost ensemble
  --calibrated_score_epi_threshold CALIBRATED_SCORE_EPI_THRESHOLD
                        Calibrated-score threshold for epitopes [low 0.40, moderate (0.90), higher 1.50]
  --no_calibrated_normalization
                        Skip Calibrated-normalization of PDBs
  --check_existing_embeddings CHECK_EXISTING_EMBEDDINGS
                        Check for existing embeddings to load in pdb_dir
  --cpu_only            Use CPU even if GPU is available (default uses GPU if available)
  --max_gpu_pdb_length MAX_GPU_PDB_LENGTH
                        Maximum PDB length to embed on GPU (1000), otherwise CPU
  --multichain_mode     Predicts entire complexes, unsupported and untested
  --save_embeddings SAVE_EMBEDDINGS
                        Save embeddings to pdb_dir
  --web_server_mode     Flag for printing HTML output
  -v VERBOSE, --verbose VERBOSE
                        Verbose logging

DiscoTope-3.0 output

DiscoTope-3.0 splits input PDBs into single-chain PDB files, then predict per-residue epitope propensity scores. Outputs are saved in both PDB and CSV format.

The CSV output files contains per-residue outputs, with the following column headers:

  • PDB ID and chain name
  • Relative residue index (re-numbered from 1)
  • Amino-acid residue, 1-letter
  • DiscoTope-3.0 score (0.00 - 1.00)
  • Predicted epitope (True or False), based on calibrated_score_epi_threshold (default 0.90)
  • Relative surface accessibility (Shrake-Rupley, normalized using Sander scale)
  • AlphaFold pLDDT score (0-100, set to 100 for non-AlphaFold structures)
  • Chain length
  • A binary feature set to 0 for solved and 1 for AlphaFold structures.

The PDB output files contain individual single chains with the B-factor column replaced with per-residue DiscoTope-3.0 scores (2nd right-most column). Note that the scores are multiplied by 100 as PDB files only allow 2 decimals of precision.

Example input PDB (see 7c4s.pdb):

python discotope3/main.py --pdb_or_zip_file data/example_pdbs_solved/7c4s.pdb

Example output CSV (see 7c4s_A_discotope3.csv):

pdb,res_id,residue,DiscoTope-3.0_score,rsa,pLDDTs,length,alphafold_struc_flag
7c4s_A,14,G,0.15186,0.80634,100,282,0
7c4s_A,15,Q,0.13953,0.45077,100,282,0
7c4s_A,16,E,0.23955,0.72919,100,282,0

Example output PDB (see 7c4s_A_discotope3.pdb): (Note DiscoTope-3.0 scores in the B-factor column)

ATOM      1  N   GLY A  14     -16.773 -32.069  23.105  1.00 15.19           N  
ATOM      2  CA  GLY A  14     -15.595 -32.029  23.955  1.00 15.19           C  
ATOM      3  C   GLY A  14     -14.287 -31.844  23.204  1.00 15.19           C  
ATOM      4  O   GLY A  14     -13.284 -32.465  23.555  1.00 15.19           O  

Reproduce test-set predictions (AlphaFold2 structures)

# Unzip AlphaFold2 test set
unzip data/test_set_af2.zip -d data/

# Run predictions on PDB folder
python discotope3/main.py \
--pdb_dir data/test_set_af2 \
--struc_type alphafold \
--out_dir output/test_set_af2

Troubleshooting

  • No valid amino-acid backbone found" - DiscoTope-3.0 only predicts epitopes on amino-acids, not on non-amino acid entities like heteroatoms (e.g. water, solvents like dimethyl sulfoxide). These chains should not be specified as input.
  • PDBConstructionWarning regarding discontinuous chains - Common issue with some PDB files (experimental structures only) missing co-ordinates for some atoms. As long as no backbone co-ordinates (C, Ca, N) are missing, it does not impact predictions.

Installation gcc or g++ errors, missing torch-scatter build ...

# Make sure gcc and g++ versions are updated, pybind11 is available
# torch-scatter should be listed with 'conda list' or 'pip list'

# With conda:
conda install -c conda-forge pybind11 gcc cxx-compiler

# With apt-get
sudo apt-get install gcc g++
pip install pybind11

Citing this work

The code and data in this package is based on the following paper DiscoTope-3.0. If you use it, please cite:

@ARTICLE{discotope3,
        AUTHOR={Høie, Magnus Haraldson  and Gade, Frederik Steensgaard  and Johansen, Julie Maria  and Würtzen, Charlotte  and Winther, Ole  and Nielsen, Morten  and Marcatili, Paolo },
        TITLE={DiscoTope-3.0: improved B-cell epitope prediction using inverse folding latent representations},
        JOURNAL={Frontiers in Immunology},
        VOLUME={15},
        YEAR={2024},
        URL={https://www.frontiersin.org/journals/immunology/articles/10.3389/fimmu.2024.1322712},
        DOI={10.3389/fimmu.2024.1322712},
        ISSN={1664-3224},
}

License

This source code is licensed under the Creative Commons license found in the LICENSE file in the root directory of this source tree.