Requirements
- Python3
- Numpy
- Pandas
- perl
- sklearn
- xgboost
- bedtools
- keras
- tensorflow
- gcc
- Download hg19.fa file from UCSC and put it into "Model_building" directory with "hg19.fa" name.
Running script[model building]
This script contain two modules:
- For model building (This module requires CLIP-Seq peak data in bed file format)
To run the script:
./RBPSpot_model <bed_file> <window_size>
E.g. ./RBPSpot_model Example.bed 17
To build the model with XGboost:
python3 xgb_model.py Example.bed_train Example.bed_test
bedfile contains peak data for CLIP-seq data, bedfile can be given with any name, but it's name will be used as prefix for all the files generated in this step. And also some of the file will be used in next step.
window size can vary between 17-131. For optimum result try with variable window size.
Output description [model building]
bedfile_model
E.g. Example.bed_model
Folder contaiining the model file in .pb format and it's assets and variable.
Running script [Scanning module]
- For scanning sequences with built model we requires five files in Model_Scan folder: (i). Input sequences (ii). bedfile_model folder [Generated after model building process] (iii). bedfile_penta_Prob_value [Generated after model building process] (iv). bedfile_penta_Prob_value [Generated after model building process] (v). bedfile_primary_motif [Generated after model building process]
To run the script:
./scan <bed_file> <Input_sequence> <window_size>
For parallel: sh parallel.sh <#Processors> <Input_sequence> <bed_file> <window_size>
E.g. Shift "Example.bed_model" directory into Model_Scan directory along with Example.bed_penta_Prob_value, Example.bed_penta_Prob_value and Example.bed_primary_motif files. Then run: ./scan Example.bed Example_sequence.fa 17
bedfile name must be the same name used in last step at the time of Model_building step. Input_sequence file must be in single line fasta and sequence length must be >=160 bases. window_size must be the same number used in Model_building step. As different window size will generate different number of feature vector, hence model will not be able to test any feature vector.
Output description [Scanning module]
Input_sequence_output.tsv File contain 3 columns: Seuqence_name Start_coordinate End_coordinate
E.g. Example_sequence.fa_output.tsv
Web-server version for 131 RBPs available at:
https://scbb.ihbt.res.in/RBPSpot/
Citation: Sharma NK, Gupta S, Kumar A, Kumar P, Pradhan UK, Shankar R (2021) RBPSpot: Learning on appropriate contextual information for RBP binding sites discovery. iScience 24(12). https://doi.org/10.1016/j.isci.2021.103381.