Predict protein-interactions from raw pdb(protein-databank)-files. This diagram illustrates the different functionallities of this software-tool and how the different parts relate to each other
A script that trys to install all necessary dependencies can be found in setUp/setUp.sh. Install all dependencies by typing:
cd setUp
./setUp.sh
In order to view this documentation as displayed on github on your local machine type:
./docs/createDoc.sh
This requires mkdocs to be installed.
In order to test the installation a script can be found in /QuickStart/SmallExample/ called TestInstallation.R. Execute with
./TestInstallation.R
This script performs the steps described in the next section QuickStart. If an error occures the script aborts with an error-message.
In /QuickStart/SmallExample/ a small example can be found on which the funtionality and correct installation can be tested. For more details of the specific features refer to the corresponding sections in this documentation.
Adjust the parameters on your machine in the following call:
<pathToPredictingProteinInteractions>/PredictingProteinInteractions/Classification/./predictingProteinInteractions.R --mode SingleDistance --doClustering TRUE --pdb_folder <pathToPredictingProteinInteractions>/PredictingProteinInteractions/QuickStart/SmallExample/pdb_train/ --distances_train <pathToPredictingProteinInteractions>/PredictingProteinInteractions/QuickStart/SmallExample/Train/UltraQuickRepSub/ --numberOfPoints 4 --rounds 10 --MutCompOutPut <pathToPredictingProteinInteractions>/PredictingProteinInteractions/QuickStart/SmallExample/Train/ --doMutComp TRUE --q_val 1 --labels_train <pathToPredictingProteinInteractions>/PredictingProteinInteractions/data/labels106.txt
where pathToPredictingProteinInteractions is the path to the implementation on your machine.
This invokes the preprocessing with MutComp and VMD. Then UltraQuickRepeatedSubSampling is called and a dendrogram is produced. If everything goes well the dendrogramms can be found in /QuickStart/SmallExample/Dendrogramms/ and should look similar to this one:
In order to train a neural net type:
<pathToPredictingProteinInteractions>/PredictingProteinInteractions/MetricGeometry/QuickRepeatedSubSampling/./Proteins.R --pathToExperiment <pathToPredictingProteinInteractions>/PredictingProteinInteractions/QuickStart/SmallExample/Train/Output/ --mode evaluation --outPutFolder Test1 --useSmallExample
In /QuickStart/SmallExample/Train/NNexperimentsKfoldCV/Test1/ a file nnModel.h5 can then be found, which can be used to make predictions on new data. The flag useSmallExample makes sure that the parameters are adjusted in such a way that the training-time is reduced. However this does not lead to meaningfull results, and hence when you want to train a model this flag should not be used.
In order to make predictions for new data you first need to train a neural net as described in the previous section. The resulting neural net can then be used to make predictions on a new data set.
In /QuickStart/SmallExample/pdb_predict a folder with pdb-files for which predictions should be made, can be found. First you need to preprocess the pdbs with PredictingProteinInteractions.R as follows:
<pathToPredictingProteinInteractions>/Classification/./predictingProteinInteractions.R --mode SingleDistance --doClustering FALSE --pdb_folder <pathToPredictingProteinInteractions>/QuickStart/SmallExample/pdb_predict/ --MutCompOutPut <pathToPredictingProteinInteractions>/QuickStart/SmallExample/Predict/ --doMutComp TRUE --q_val 1
Then using the model that was trained and placed in /QuickStart/SmallExample/Train/NNexperimentsKfoldCV/Test1/ in the previous section make predictions on the new data:
../../MetricGeometry/QuickRepeatedSubSampling/./Proteins.R --outPutFolder TestPred --pathToExperiment <pathToPredictingProteinInteractions>/QuickStart/SmallExample/Predict/Output/ --mode prediction --nnModelFolder <pathToPredictingProteinInteractions>/QuickStart/SmallExample/Train/NNexperimentsKfoldCV/Test1/
The results can be found in /QuickStart/SmallExample/Predict/NNexperimentsKfoldCV/TestPred/ and can look similar to this one (predictions.txt):
name | functional | not_functional |
---|---|---|
000_Grx4_1 | 0.50 | 0.50 |
013 | 0.25 | 0.75 |
085 | 1.00 | 0.00 |
pdb_folder ... pdb-files to be processed
MutCompOutPut ... output of MutComp (dx,pts, ...)
MutCompParametersFile ... full path to the parameters-file
doMutComp ... compute MutComp
doClustering ... compute a clustering
subNamesFile ... file with a subset of names available in the output-folder. Then
only with these names the clustering is done.
mode(Train/Predict/SingleDistance) ... either Train a model or make predictions
distances_train ... folder in which all distance-matrices will be stored
numberOfPoints ... number of points to select (see parameters of RepeatedSampling)
rounds ... number of repetitions
q ... number that indicates how fine the approximation should be
(higher => more accurate, but slower)
distances_train ... folder with all available distances
MutCompOutPut ... folder with the proteins, that means
as output of MutComp (dx,pts-files neccessary)
labels_train ... a file with "names" and "labels" sepcifying the functions of the proteins
the model will be build only on the proteins that are mentioned in this file
That means the names must occur in the column-names and row-names of the distances
in the folder "Distances"
For a detailed description refer to: (MutComp).
The standard path to the parameters-file is (for the use on the WS):
/home/sysgen/Documents/LWB/TCL/MutComp/GUI/Parameters/parameters.txt
. To run the script type:
<pathToMutComp>./process.sh
where pathToMutComp is the path to the script. If you want to use a different parameter-file just pass it as an argument to the script
<pathToMutComp>./process.sh <path/to/a/different/parametersfile>
On my machine this call becomes:
/home/willy/PredictingProteinInteractions/PreProcessingProteins/MutComp/./process.sh /home/willy/PredictingProteinInteractions/PreProcessingProteins/MutComp/GUI/Parameters/parametersForThesis.txt
I strongly recommmend placing the parametersFile for each data-set in the same folder.
For a detailed description refer to: (centerSelect).
Selects a box around the active site.
Based on the approximation of the DE multiple features are calculated. Then data-augmentation is performed. Then a neural net is trained. The neural net can then be used to predict if two given proteins react with each other.
Given the positive and negative potential of the proteins and the active center calculate multiple features. These features are the quantiles of the approximation of the DE (Distribution of Eccentricities). The features are then used to train a neural net.
There are two options:
- training a neural net and evaluating with k-fold-Cross-Validation. Additionally this model can be exported.
- using a pre-trained model to make predictions on new data.
If you only want to generate the features (e.g. in case you want to call Clusterin.R afterwards) run with onlyGenerateModels = TRUE
- In the protein-folder the .obj-file has to be present which is delivered by MutComp.
- If the measure should be updated with the information about the active center, then the file "active_center.pts" has to be present which is delivered by selectCenter.R.
pathToExperiment ... path to the folder that contains all the folders of all proteins. (Output-folder of MutComp).
onlyGenerateModels ... 1: only generate the features of the models. Else the features are generated and then the neural-net
is trained.
mode ... prediction/evaluation. Either make predictions on new data or perform k-fold-cv to evaluate a model.
parametersFile ... contains the parameters need for the features and the neural net. Separated by ; the arguments
are inserted row-wise. If no parametersFile is specified then the default-parameters are used..
outPutFolder ... name of the folder in which all output is saved. The folder is placed in
<pathToExperiment>/NNexperimentsKfoldCV/. So <outPutFolder> is no full path!
Most of the ~50 parameters are specified in an additional parameters-file. Note that when executing this script the parameters specified on the commmand-line have higher priority and overwrite the parameters in the parameters-file.
a1,a2,a3,a4,a5 ... specifies how important the active site is. a1, b1, n1 together form the parameters for one
feature. Set to -1 if you want to use fewer features.
b1,b2,b3,b4,b5 ... specifies how important the boarder region is. a1, b1, n1 together form the parameters for one
feature. Set to -1 if you want to use fewer features.
mNearestNeighbors ... how many points in close proximity to the boarder-area are considered.
n1,n2,n3,n4,n5 ... in (0,1). Specifies how local the feature is. a1, b1, n1 together form the parameters for one
feature. Set to -1 if you want to use fewer features.
recalculateModel ... specifies if the preprocessing should be recalculated. If set to 0, if the file allready exists
it is not recalculated.
recalculateQuants ... specifies if the quantiles should be recalculated. If set to 0 then if the file allready exists
it is not recalculated.
n_s_euclidean ... number of points to select with the euclidean fps (farthest-point-sampling-procedure).
n_s_dijkstra ... number of points to select with the geodesic fps (farthest-point-sampling-procedure).
stitchNum ... number of points that are created from Manifold.
An exemplary parameters-file is shown below. The parameters for both the features and the neural net are stored in this file. The columns are ","-separated.
parameter | value |
---|---|
a1 | 1 |
a2 | 1 |
a3 | 1 |
a4 | 1 |
a5 | 1 |
b1 | 1 |
b2 | 1 |
b3 | 1 |
b4 | 1 |
b5 | 1 |
n1 | 0.1 |
n2 | 0.2 |
n3 | 0.3 |
n4 | 0.5 |
n5 | 0.8 |
mNearestNeighbor | 10 |
pathToExperiment | /home/willy/PredictingProteinInteractions//data/106Test/Output/ |
n_s_euclidean | 1000 |
n_s_dijkstra | 1000 |
stitchNum | 2000 |
recalculateModel | 0 |
recalculateQuants | 0 |
sampleSize | 20 |
l1 | 100 |
l2 | 100 |
l3 | 100 |
l4 | 50 |
l5 | 30 |
d1 | 0.2 |
d2 | 0.2 |
d3 | 0.2 |
d4 | 0.2 |
d5 | 0.2 |
epochs | 20 |
batchSize | 16 |
experimentName | Dummy |
recalculateNN | 0 |
kFolds | 10 |
numPermutations | 400 |
numCores | 4 |
mode | evaluation |
For each fold three files are created: accuracy.txt, f1_score.txt, confMat.txt storing the accuracy, the F1-score and the confusion-matrix. Additionaly three such files with tex-extensions are created which are obtained by averaging over the folds.
If mode is set to "evaluation" then the necessary features are generated as specified in parametersFile. Then with k-fold cross-validation a neural net is trained. For each fold the accuracy, F1-score and confusion-matrix is saved. In case of k == 1 additionally the finall trained model is exported. This model can then be used for prediction of new data.
An exemplary call looks like this:
./Proteins.R --pathToExperiment /home/willy/PredictingProteinInteractions/data/106Test/Output/ --mode evaluation --outPutFolder Dummy2
The output is a confusion-matrix (here with k == 1) and might look like this:
functional | not_functional | |
---|---|---|
functional | 0.90000000 | 0.02865169 |
not_functional | 0.10000000 | 0.97134831 |
Keep in mind that with lower k, that means a lower number of different folds, the reliabilty of the evaluation of the model for new unseen data shrinks. With k == 1, the evaluation is only a training-error and hence does not show how well the model will perform on unseen data.
If mode is set to "predict" then a previously trained neural net is read in and a folder containing new data has to be specified. For the proteins in that new folder predictions are made with the neural net.
An exemplary call looks like this:
./Proteins.R --outPutFolder Dummy5 --pathToExperiment /home/willy/PredictingProteinInteractions/data/NewPredTest/Output/ --mode prediction --nnModelFolder /home/willy/PredictingProteinInteractions/data/106Test/NNexperimentsKfoldCV/Dummy2/
The above call creates a new folder /home/willy/PredictingProteinInteractions/data/NewPredTest/Output/NNexperimentsKfoldCV/Dummy5/ in which all data will be stored. Predictions will be made for all proteins that can be found in the folder /home/willy/PredictingProteinInteractions/data/NewPredTest/Output/. The previously trained neural net that will be used for prediction is located in /home/willy/PredictingProteinInteractions/data/106Test/NNexperimentsKfoldCV/Dummy2/.
The predictions are then stored in a file predictions.txt. Row-wise the names of the proteins are denoted. Column-wise the predicted probabilities for the specific class can be seen.
name | functional | not_functional |
---|---|---|
1aba | 0.2 | 0.8 |
5e37 | 0.1 | 0.9 |
5j3r | 0 | 1 |
5jy5 | 0.01 | 0.99 |
Create dendrograms like this one from the 106-Redoxin-data-set:
Calculate all pairwise distances between the proteins. From this distance-matrix build an agglomerative bottom-up clustering.
Calculate the repeatedSubSampling fast. The points are only sampled m times once for each protein. Then the distributions are calculated. Then from these distributions the quantile- approximation is calculated. The DE is then calculated with the manhattan distance between the quantiles.
ProteinsPath ... path to all proteins as produced by MutComp
distance_name ... folder in which all distance-matrices will be stored
n ... number of points to select (see parameters of RepeatedSampling)
m ... sqrt(number of repeatitions)
q ... number of subdivisions of the integral. Basically a higher q
leads to a more accurate approximation. Currently it has to hold
q < n.
potential ... pos/neg
distance_method ... geo/emd
plot ... in case of (q == 2) the approximations are ploted into the 2d-plane.
Else no plot is created with a warn-message.
labels ... a file containing for each protein a label functional/not_functional
cores ... number of cores to run on. Trys to detect the number of cores
automatically. If this fails the number of cores is set to 6
by default.
Train an auto-encoder that reduces the number of features of the proteins. Then the bottle-neck-layer is extracted and used to extract a condensed representation of the proteins. With these condensed representations then an agglomerative, bottom-up clustering is performed. With this clustering then a dendrogram is obtained.
inputPath ... path to a file with a .Rdata-extension that stores the feature-matrix for
all proteins. Can be obtained by running Proteins.R.
dendrogramName ... name of the dendrogram-file. Without the (.pdf)-extension and prefix Dendrogram.
numPermutations ... number of permutations that are created for each representation. For
each protein m rows from the feature-matrix are combined. The order of
this m rows is randomly permutated and numPermutations different
representations are created.
m ... number of rows from the feature-matrix that are combined for each protein-model.
epochs ... number of epochs to train the autoencoder.
batchSize ... size of a batch. Relevant for the training.
l1,l2,l3 ... specifies the encoder-dimensions, that is the size of each layer in the network.
The autoencoder in this implementation has 3 layers.
d1,d2,d3 ... in [0,1) specifies the dropout-rates.