Skip to content

Magnushhoie/DiscoTope-3.0

Repository files navigation

Overview

DiscoTope-3.0 is a structure-based B-cell epitope prediction tool, exploiting inverse folding latent representations from the ESM-IF1 model. The tool accepts input protein structures in the PDB format (solved or predicted), and outputs per-residue epitope propensity scores in both a PDB and CSV format.

DiscoTope-3.0 accepts both experimental and AlphaFold2 modeled structures, with similar performance for both. It has been trained and validated only on single chain structures.

Repo contents

  • data: Example input files, including test set
  • discotope3: Source code
  • output: DiscoTope-3.0 output examples

Recommended system requirements

Quickstart guide

# Setup environment and install
conda create --name inverse python=3.9 -y
conda activate inverse
conda install -c pyg pyg -y
conda install -c conda-forge pip -y

git clone https://github.com/Magnushhoie/discotope3_web/
cd discotope3_web/
pip install .

# Unzip models to use
unzip models.zip

# 1. Predict single PDB (solved)
python discotope3/main.py --pdb_or_zip_file data/example_pdbs_solved/7c4s.pdb

Installation guide

We highly recommend using an Ubuntu OS and Conda (miniconda or anaconda) for installing required dependencies.

Predictions are faster using a GPU and the recommended versions of pytorch, pytorch-geometric and cudatoolkit, but these exact versions are not required.

For Linux with conda (recommended, ~2 mins)

# Setup environment with conda
conda create -n inverse python=3.9
conda activate inverse
conda install pytorch=1.11 cudatoolkit=11.3 -c pytorch
conda install pyg -c pyg -c conda-forge
conda install pip

# install pip dependencies
pip install .

Linux with pip (~5 mins)

# install pip dependencies
pip install -r requirements_recommended.txt
pip install .

Running DiscoTope-3.0

DiscoTope-3.0 can predict a single PDB, a folder or ZIP file of PDBs, or fetch PDBs using their IDs from RCSB or AlphafoldDB to predict them.

On a common workstation with a GPU, predictions takes <1 second per PDB chain with ~ 15 seconds for loading needed libraries and model weight.

Set the --struc_type parameter to 'solved' for experimentally solved structures (default) or 'alphafold' for modelled structures.

Note that DiscoTope-3.0 splits PDB structures into single chains before prediction, unless --multi_chain_mode is set.

# Unzip models
unzip models.zip

# Now select one of multiple options:

# 1. Predict single PDB (solved)
python discotope3/main.py --pdb_or_zip_file data/example_pdbs_solved/7c4s.pdb

# 2. Predict AlphaFold structure
python discotope3/main.py --pdb_or_zip_file data/example_pdbs_alphafold/7tdm_B.pdb --struc_type alphafold

# 3. Predict a folder of PDBs
python discotope3/main.py --pdb_dir data/example_pdbs_solved

# 4. Predict a ZIP file of PDBs
python discotope3/main.py --pdb_or_zip_file pdbs_in_zipfile.zip

# 5. Fetch PDB IDs from file and predict
# Nb: Fetches from RCSB if struc_type is solved and AlphaFolddb if alphafold
python discotope3/main.py --list_file pdb_list_solved.txt --struc_type solved

DiscoTope-3.0 output

DiscoTope-3.0 splits input PDBs into single-chain PDB files, then predict per-residue epitope propensity scores. Outputs are saved in both PDB and CSV format.

The CSV output files contains per-residue outputs, with the following column headers:

  • PDB ID and chain name
  • Relative residue index (re-numbered from 1)
  • Amino-acid residue, 1-letter
  • DiscoTope-3.0 score (0.00 - 1.00)
  • Predicted epitope (True or False), based on calibrated_score_epi_threshold (default 0.90)
  • Relative surface accessibility (Shrake-Rupley, normalized using Sander scale)
  • AlphaFold pLDDT score (0-100, set to 100 for non-AlphaFold structures)
  • Chain length
  • A binary feature set to 0 for solved and 1 for AlphaFold structures.

The PDB output files contain individual single chains with the B-factor column replaced with per-residue DiscoTope-3.0 scores (2nd right-most column). Note that the scores are multiplied by 100 as PDB files only allow 2 decimals of precision.

Example input PDB (see 7c4s.pdb):

python discotope3/main.py --pdb_or_zip_file data/example_pdbs_solved/7c4s.pdb

Example output CSV (see 7c4s_A_discotope3.csv):

pdb,res_id,residue,DiscoTope-3.0_score,rsa,pLDDTs,length,alphafold_struc_flag
7c4s_A,14,G,0.15186,0.80634,100,282,0
7c4s_A,15,Q,0.13953,0.45077,100,282,0
7c4s_A,16,E,0.23955,0.72919,100,282,0

Example output PDB (see 7c4s_A_discotope3.pdb): (Note DiscoTope-3.0 scores in the B-factor column)

ATOM      1  N   GLY A  14     -16.773 -32.069  23.105  1.00 15.19           N  
ATOM      2  CA  GLY A  14     -15.595 -32.029  23.955  1.00 15.19           C  
ATOM      3  C   GLY A  14     -14.287 -31.844  23.204  1.00 15.19           C  
ATOM      4  O   GLY A  14     -13.284 -32.465  23.555  1.00 15.19           O  

Reproduce test-set predictions (AlphaFold2 structures)

# Unzip AlphaFold2 test set
unzip data/test_set_af2.zip -d data/

# Run predictions on PDB folder
python discotope3/main.py \
--pdb_dir data/test_set_af2 \
--struc_type alphafold \
--out_dir output/test_set_af2

Common issues

  • No valid amino-acid backbone found: Occurs if only heteroatoms (non-amino acid residues) are found in the extracted chain. DiscoTope-3.0 requires full amino-acid backbone C, Ca and N atoms.
  • PDBConstructionWarning regarding discontinuous chains: Indicates missing residue atoms in the input PDB file. May impact DiscoTope-3.0 performance (solved structures only)
  • Biopython future deprecation warning: Benign Biopython library warning, does not impact predictions
  • ESM regression weights missing warning: Benign fair-esm library warning, does not impact predictions

Installation gcc or g++ errors, missing torch-scatter build ...

# Make sure gcc and g++ versions are updated, pybind11 is available
# torch-scatter should be listed with 'conda list' or 'pip list'

# With conda:
conda install -c conda-forge pybind11 gcc cxx-compiler

# With apt-get
sudo apt-get install gcc g++
pip install pybind11

Citation

For usage of the package and associated manuscript, please cite according to the enclosed citation.bib.

License

This source code is licensed under the Creative Commons license found in the LICENSE file in the root directory of this source tree.