This repository contains the framework of XPSI: X-ray Free Electron Laser (XFEL) based Protein Structure Identifier. This framework combines deep learning and traditional machine learning to identify three structural properties (i.e., orientation, conformation, and protein type) through the diffraction patterns of a given protein.
Motivation: Proteins and other biological molecules are responsible for many vital cellular functions. The structure of the protein determines its functionality. Identifying the information of a protein structure is helpful to understand the protein functional mechanisms, which can help solve many difficult problems such as determining the cause of diseases and designing drugs.
Diffraction patterns: The diffraction patterns are images generated by applying an X-ray Free Electron Laser (XFEL) beam to proteins. These diffraction patterns (images) can reveal the inner structure of a protein. Specifically, three properties can be embedded in an image: the orientations of a protein conformation, the conformations of different folded proteins, and the different types of proteins.
Framework overview: The input data are the diffraction patterns (i.e., images that embed the structure of proteins). The diffraction patterns are generated by simulations or experiments using an X-ray free electron laser (XFEL) beam. The higher the beam intensity, the higher the resolution and precision in diffraction patterns. Patterns are processed by an autoencoder that captures key information and produces a tensor representation of each pattern. The autoencoder consists of an encoder and a decoder. The encoder has 3 convolutional filters and downsampling layers. The decoder has the reverse structure of the encoder. The new latent space is used to train and validate traditional machine learning models such as k-nearest neighbors (kNN). We use a kNN-angle regressor for predicting the orientation and a kNN-classificator for predicting different protein conformations.
To run the XPSI framework, open the jupyter notebook xpsi_research.ipynb
The software stack required to run the XPSI framework can be installed by using Anaconda or pip. Each of these options will installed the required dependencies.
Dependencies:
- Python=3.7.7
- numpy
- pandas
- scipy
- pillow
- scikit-learn=0.23.1
- seaborn
- matplotlib
- configparser
- tensorflow=2.0.0
- jupyter
- ipyfilechooser
Moreover, to download the data wget
is required.
If you do not have Anaconda installed, you can follow the instructions here to install it.
Make sure to change the prefix in install/env_conda
to the location of Anaconda in your local machine (e.g., /opt/anaconda3/
, /home/opt/anaconda3/
)
Run the next commands on your local machine:
conda env create -f install/env_conda.yml
conda activate xpsi
Once you have your environment installed, you can run jupyter notebook
and run the xpsi_research.ipynb
When using Conda on Power9 processors, such as Summit or Tellico clusters, the Tensorflow download needs to be registered. To do so follow the next commands on your Power9 cluster.
conda config --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/
conda env create -f install/env_conda_power9.yml
conda activate xpsi_p9
When using Conda on Jetstream, start by running the web shell desktop of the instance of your virtual machine. Then, open the terminal in the web shell desktop and enter the ezj
command. Since anaconda is a pre-installed package in Jetstream, running this command will display the folder location of anaconda. An example of anaconda directory displayed by this command: Anaconda installed to /opt/anaconda3
.
Then, open the <project_directory>/install/env_conda.yml file and change the prefix:
to the location of anaconda that was displayed using ezj. Taking the above example, you would change the prefix to the following:
prefix: /opt/anaconda3/envs/xpsi
Now, you can create a conda environment by running the following commands:
conda env create -f install/env_conda.yml
conda activate xpsi
Once you have your environment installed, you can run jupyter notebook
and run the xpsi_research.ipynb
You are required to have Python=3.7.7 installed, and automatically pip is installed with it. If you do not have pip installed follow the instructions here to install it.
Run the next commands on your local machine:
python -m pip install -r install/env_pip.txt
Once you have your environment installed, you can run jupyter notebook
and run the xpsi_research.ipynb
There are different options to launch your jupyter notebook from an HPC cluster. Here we provide two options but this step will depend on the specifications of your cluster.
Below is an example protocol to run Jupyter notebook in an interactive node on a high-performance computer (HPCs).The instructions have been adapted from this webpage. First, you would need to request an interactive node. We show an example for two schedulers: SLURM and LSF.
## For SLURM
srun --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i
## For LSF
# For cpu only
bsub -n 1 "num=1 :mode=exclusive_process" -Is bash
# If your node have gpu
bsub -gpu "num=1 :mode=exclusive_process" -Is bash
From within the interactive node, you need to activate the conda environment
cd $HOME
conda activate xpsi
Run Jupyter on the claimed interactive node. Check the node name.
jupyter notebook --no-browser --ip='0.0.0.0'
On your local terminal, start another SSH session with tunnelling using the interactive node name as noted above
ssh user@host -L8888:nodeName:8888 -N
Copy the URL that the Jupyter daemon has generated in step 4 and paste it in the browser on your computer. URL should look something similar to https://(nodeName or 127.0.0.1):8888/?token=3f7c3a8949b3fa1961c63653873fea075a93a29bffe373b5
. Choose either nodeName
or 127.0.0.1
in the URL.
Supercomputers like Summit from ORNL, include a OLCF JupyterHub. The OLCF JupyterHub implementation will spawn you into a single-user JupyterLab environment. These are the instructions to access the OLCF JupyterHub. Contact your computer facility support for help running Jupyter in other supercomputers.
Copyright (c) 2022, Global Computing Lab
XPSI is distributed under terms of the Apache License, Version 2.0 with LLVM Exceptions.
See LICENSE for more details.