Skip to content

Commit

Permalink
pip install, local esm dep
Browse files Browse the repository at this point in the history
  • Loading branch information
Magnushhoie committed Sep 14, 2023
1 parent a1b5b7e commit 13a7243
Show file tree
Hide file tree
Showing 30 changed files with 4,276 additions and 255 deletions.
190 changes: 89 additions & 101 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,45 +23,50 @@ DiscoTope-3.0 accepts both experimental and AlphaFold2 modeled structures, with

# Repo contents

- [data](./data): Test set and example antigen PDB files
- [src](./src): Source code
- [data](./data): Example input files, including test set
- [discotope3](./discotope3): Source code
- [output](./output): DiscoTope-3.0 output examples
- [help.py](./help.py): (Optional) interactive helper script for running DiscoTope-3.0

# System Requirements

## Hardware Requirements

For minimal performance, only a single core and ca 8 GB of RAM is needed. For optimal performance, we recommend the following specs:

- RAM: 16+ GB
- CPU: 4+ cores
- GPU is optional

## Software Requirements

We highly recommend using an Ubuntu OS and Conda ([miniconda](https://docs.conda.io/en/main/miniconda.html) or [anaconda](https://www.anaconda.com/products/distribution)) for installing required dependencies. Exact versions of pytorch 1.11, cudatoolkit 11.3 and pytorch-geometric, scatter and sparse are required.

### OS Requirements

The package development version is tested on a *Linux* operating system. The developmental version of the package has been tested on the following systems:

Linux: Ubuntu 18.04

### Python requirements
### Recommended system requirements
- GPU is optional. Recommended 16 GB ram, 2+ cores CPU.
- Linux operating system (e.g. Ubuntu 18.04), but works on MacOS
- [Python 3.9](https://www.python.org/downloads/)
- [Pytorch 1.11](https://pytorch.org/get-started/locally/)
- [cudatoolkit 11.3](https://anaconda.org/anaconda/cudatoolkit)
- [Pytorch geometric 2.0.4](https://github.com/pyg-team/pytorch_geometric)
- [fair-esm 0.5](https://github.com/facebookresearch/esm)
- [Biopython](https://github.com/biopython/biopython)
- [Biotite](https://github.com/biotite-dev/biotite)
- [pandas](https://github.com/pandas-dev/pandas)
- [numpy](https://github.com/numpy/numpy)
- [py-xgboost-gpu](https://xgboost.readthedocs.io/en/stable/install.html)

# Quickstart guide

```bash
# Setup environment and install
conda create --name inverse3 python=3.9 -y
conda activate inverse3
conda install -c pyg pyg -y
conda install -c conda-forge pip -y

git clone https://github.com/Magnushhoie/discotope3_web/
cd discotope3_web/
pip install .

# Unzip models to use
unzip models.zip

# 1. Predict single PDB (solved)
python discotope3/main.py --pdb_or_zip_file data/example_pdbs_solved/7c4s.pdb
```

# Installation guide

## Installing with conda (Linux) (recommended, ~2 mins)
We highly recommend using an Ubuntu OS and Conda ([miniconda](https://docs.conda.io/en/main/miniconda.html) or [anaconda](https://www.anaconda.com/products/distribution)) for installing required dependencies.

Predictions are faster using a GPU and the recommended versions of pytorch, pytorch-geometric and cudatoolkit, but these exact versions are not required.

## For Linux with conda (recommended, ~2 mins)

```bash
# Setup environment with conda
Expand All @@ -73,97 +78,58 @@ conda install pip

# install pip dependencies
pip install -r requirements.txt

# Unzip models
unzip models.zip
```

## Installing with pip only (Linux) (~5 mins)
## Linux with pip (~5 mins)
```bash
# install pip dependencies
pip install -r requirements_full.txt

# Unzip models
unzip models.zip
pip install -r requirements_recommended.txt
```

## Nb: gcc or g++ errors, missing torch-scatter build ...
## For MacOS
```bash
# Make sure gcc and g++ versions are updated, pybind11 is available
# torch-scatter should be listed with 'conda list' or 'pip list'

# With conda:
conda install -c conda-forge pybind11 gcc cxx-compiler

# With apt-get
sudo apt-get install gcc g++
pip install pybind11
```

For GPU accelerated predictions, please install [py-xgboost-gpu](https://xgboost.readthedocs.io/en/stable/install.html) and make sure a GPU is available.
# Setup environment with conda
conda create -n inverse python=3.9
conda install pyg -c pyg -c conda-forge

```bash
conda install -c conda-forge py-xgboost-gpu
# install pip dependencies
pip install -r requirements.txt
```

# Demo
## Running DiscoTope-3.0

On a common workstation with a GPU, predictions takes <1 second per PDB chain with ~ 15 seconds for loading needed libraries and model weight. Ensure XGBoost model weights are unzipped by first running 'unzip models.zip' (see [Installation Guide](#installation-guide)). ESM-IF1 weights will be automatically downloaded the first time the prediction script is run (~ 1 min)
DiscoTope-3.0 can predict a single PDB, a folder or ZIP file of PDBs, or fetch PDBs using their IDs from RCSB or AlphafoldDB to predict them.

## Predict a single PDB (solved structure)
On a common workstation with a GPU, predictions takes <1 second per PDB chain with ~ 15 seconds for loading needed libraries and model weight.

```bash
# Run on single PDB on CPU only (by default checks for available GPU)
python src/predict_webserver.py \
--cpu_mode \
--pdb_or_zip_file data/example_pdbs_solved/7c4s.pdb \
--struc_type solved \
--out_dir output/7c4s
```
Set the --struc_type parameter to 'solved' for experimentally solved structures (default) or 'alphafold' for modelled structures.

## Reproduce test-set predictions (AlphaFold2 structures)
Note that DiscoTope-3.0 splits PDB structures into single chains before prediction, unless --multi_chain_mode is set.

```bash
# Unzip AlphaFold2 test set
unzip data/test_set_af2.zip -d data/

# Run predictions on PDB folder
python src/predict_webserver.py \
--pdb_dir data/test_set_af2 \
--struc_type alphafold \
--out_dir output/test_set_af2
```
# Unzip models
unzip models.zip

## Running on own data (batch-mode)
# Now select one of multiple options:

Set the --struc_type parameter to 'solved' for experimentally solved structures or 'alphafold' for modelled structures.
# 1. Predict single PDB (solved)
python discotope3/main.py --pdb_or_zip_file data/example_pdbs_solved/7c4s.pdb

Note that DiscoTope-3.0 splits PDB structures into single chains before prediction.
# 2. Predict AlphaFold structure
python discotope3/main.py --pdb_or_zip_file data/example_pdbs_alphafold/7tdm_B.pdb --struc_type alphafold

```bash
# Predict on example (solved) PDBs in data/example_pdbs_solved folder
python src/predict_webserver.py \
--pdb_dir data/example_pdbs_solved \
--struc_type solved \
--out_dir output/example_pdbs_solved

# Fetch & predict PDBs from list file from AlphaFoldDB
python src/predict_webserver.py \
--list_file data/af2_list_uniprot.txt \
--struc_type alphafold \
--out_dir output/af2_list_uniprot
# 3. Predict a folder of PDBs
python discotope3/main.py --pdb_dir data/example_pdbs_solved

# Fetch & predict PDBs from list file from RCSB
python src/predict_webserver.py \
--list_file data/solved_list_rcsb.txt \
--struc_type solved \
--out_dir output/solved_list_rcsb
# 4. Predict a ZIP file of PDBs
python discotope3/main.py --pdb_or_zip_file pdbs_in_zipfile.zip

# See more options
python automate.py
# 5. Fetch PDB IDs from file and predict
# Nb: Fetches from RCSB if struc_type is solved and AlphaFolddb if alphafold
python discotope3/main.py --list_file pdb_list_solved.txt --struc_type solved
```

# Results
# DiscoTope-3.0 output

DiscoTope-3.0 splits input PDBs into single-chain PDB files, then predict per-residue epitope propensity scores.
Outputs are saved in both PDB and CSV format.
Expand All @@ -172,20 +138,18 @@ The CSV output files contains per-residue outputs, with the following column hea
- PDB ID and chain name
- Relative residue index (re-numbered from 1)
- Amino-acid residue, 1-letter
- DiscoTope-3.0 score (theoretical range 0.00 - 1.00)
- DiscoTope-3.0 score (0.00 - 1.00)
- Predicted epitope (True or False), based on calibrated_score_epi_threshold (default 0.90)
- Relative surface accessibility (Shrake-Rupley, normalized using Sander scale)
- AlphaFold pLDDT score (0-100, set to 100 for non-AlphaFold structures)
- Chain length
- A binary feature set to 1 for AlphaFold structures.
- A binary feature set to 0 for solved and 1 for AlphaFold structures.

The PDB output files contain individual single chains with the B-factor column replaced with per-residue DiscoTope-3.0 scores (2nd right-most column). Note that the scores are multiplied by 100 as PDB files only allow 2 decimals of precision.

Example input PDB (see [7c4s.pdb](./data/example_pdbs_solved/7c4s.pdb)):
```bash
python src/predict_webserver.py \
--pdb_or_zip_file data/example_pdbs_solved/7c4s.pdb \
--struc_type solved \
--out_dir output/7c4s
python discotope3/main.py --pdb_or_zip_file data/example_pdbs_solved/7c4s.pdb
```

Example output CSV (see [7c4s_A_discotope3.csv](./output/7c4s/output/7c4s_A_discotope3.csv)):
Expand All @@ -197,22 +161,46 @@ pdb,res_id,residue,DiscoTope-3.0_score,rsa,pLDDTs,length,alphafold_struc_flag
```

Example output PDB (see [7c4s_A_discotope3.pdb](./output/7c4s/output/7c4s_A_discotope3.pdb)):
(Note DiscoTope-3.0 scores in the B-factor column)
```text
ATOM 1 N GLY A 14 -16.773 -32.069 23.105 1.00 15.19 N
ATOM 2 CA GLY A 14 -15.595 -32.029 23.955 1.00 15.19 C
ATOM 3 C GLY A 14 -14.287 -31.844 23.204 1.00 15.19 C
ATOM 4 O GLY A 14 -13.284 -32.465 23.555 1.00 15.19 O
```

## Reproduce test-set predictions (AlphaFold2 structures)

```bash
# Unzip AlphaFold2 test set
unzip data/test_set_af2.zip -d data/

# Run predictions on PDB folder
python discotope3/main.py \
--pdb_dir data/test_set_af2 \
--struc_type alphafold \
--out_dir output/test_set_af2
```

# Common issues

- No valid amino-acid backbone found: Occurs if only heteroatoms (non-amino acid residues) are found in the extracted chain. DiscoTope-3.0 requires full amino-acid backbone C, Ca and N atoms.
- PDBConstructionWarning regarding discontinuous chains: Indicates missing residue atoms in the input PDB file. May impact DiscoTope-3.0 performance (solved structures only)
- Biopython future deprecation warning: Benign Biopython library warning, does not impact predictions
- ESM regression weights missing warning: Benign fair-esm library warning, does not impact predictions

# Note on reproducibility
Output is deterministic, i.e. the same machine will always produce the same output. However, if comparing results run on an older CUDA version or GPU, minor discrepancies in DiscoTope-3.0 scores may occur from the 4th significant figure e.g. 0.27130 -> 0.27125. These differences are due to inherent variability in floating point computations, arising especially from changes in algorithms / optimizatons across CUDA toolkit versions.
## Installation gcc or g++ errors, missing torch-scatter build ...
```bash
# Make sure gcc and g++ versions are updated, pybind11 is available
# torch-scatter should be listed with 'conda list' or 'pip list'

# With conda:
conda install -c conda-forge pybind11 gcc cxx-compiler

# With apt-get
sudo apt-get install gcc g++
pip install pybind11
```

# Citation
For usage of the package and associated manuscript, please cite according to the enclosed [citation.bib](./citation.bib).
Expand Down
Empty file added discotope3/__init__.py
Empty file.
7 changes: 7 additions & 0 deletions discotope3/esm/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

from . import pretrained # noqa
from .data import Alphabet
10 changes: 10 additions & 0 deletions discotope3/esm/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

# fmt: off
proteinseq_toks = {
'toks': ['L', 'A', 'G', 'V', 'S', 'E', 'R', 'T', 'I', 'D', 'P', 'K', 'Q', 'N', 'F', 'Y', 'M', 'H', 'W', 'C', 'X', 'B', 'U', 'Z', 'O', '.', '-']
}
# fmt: on
Loading

0 comments on commit 13a7243

Please sign in to comment.