Skip to content

Yet another bioinformatics tool to detect de novo splice junctions from paired-end RNA-seq reads (human genome only)

Notifications You must be signed in to change notification settings

lifengtian/SplicePL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LIMITATION!!!!
     So far, SplicePL does NOT use non-uniquely mapped reads in the splice junction analysis. Since we really don't know
     how to resolve the ambiguous reads, we choose NOT to use them. 

     SplicePL only works with human genome right now. The chromosome numbers are hard-coded from chr1 to chrM. 
     I will encourage to change the source code to make it work for your genome of interest.


INTRODUCTION

	SplicePL is a simple BLAT parser and splice site filter, originally hacked in one day to test the idea of 
        using BLAT as an aligner. After it showed us promising results, it was rewritten in Perl5. 

        We find it a useful tool to detect de novo splice junctions from paired-end human RNA-seq 
        data, especially for reads longer than 75 nt. With these long reads, novel exon-exon junctions can be directly 
        identified by aligning reads to the reference genome using algorithms such as BLAT (Kent, 2002) or SSAHA2 (Ning 
        et al., 2001), rather than splitting reads into segments as in TopHat (Trapnell et al., 2009) and SpliceMap (Au 
        et al., 2010).

	SplicePL detects splice junctions by a three-stage process. Initially, each read is aligned to the reference 
        genome using BLAT, followed by identifying  spliced reads across exon-exon junction.  Finally, the putative 
        junctions are identified and filtered based on splice site donor/acceptor site, intron and flanking sequence 
        length.

	In summary, SplicePL is a simple BLAT parser and splice site filter. It is written in Perl, packaged as
	a perl module to facilitate testing and installation. The code can be improved a lot by using BioPerl modules,
        however, for the sake of simplicity and ease of use, we choose to implement it the way it is.

        At the moment of writing (Jun 2010), SplicePL has the highest sensitivity for long reads (at least 100 nt)
        compared with TopHat and SpliceMap. However, SpliceMap has better specificity than SplicePL. 

        In our own RNA-seq analysis, we used TopHat, SpliceMap, SplicePL, and GSNAP. We find robust results
        can be obtained from highly expressed transcripts. Lowly expressed junctions tend to disagree more for
        different methods. 


PREREQUISITE
	Before running  SplicePL, please make sure:
	1. Either BLAT or gfServer/gfClient binaries is installed. The most recent version (Jun 25, 2010) is 34x7 .
	   You also need faToTwoBit, which is part of BLAT.
	   For BLAT installation, please go to https://genome.ucsc.edu/FAQ/FAQblat.html#blat3
	   
	2. fasta files for individual human chromosomes were downloaded, and in the same folder, a 2BIT format file is 
           prepared from chr1.fa, chr2.fa, ..., chr22.fa, chrX.fa, chrY.fa,and chrM.fa. You can download the hg19 
           assembly ( or The Feb. 2009 human reference sequence (GRCh37) ) from 
           https://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
	
	3. you know exact number of RAM and processors in your LINUX server. 


HOW TO USE IT
	SplicePL.pm is designed to run on a LINUX server with large memory and multiple CPUs. SplicePL is  tested on 
        Ubuntu 10.04 and Red Hat 4.1.2-46.

	It requires 3GB of RAM to load the hg19 assembly to the memory. To run multiple BLAT processes (n=N), you need 
        to have at least 4.0*N GB of RAM. 
 

	You run SplicePL.pm within a LINUX terminal. 
	1. (Recommended) Use the BLAT program which uses about 4.0GB of RAM for human genome hg19, including index 
           (1.3GB) plus sequences (3GB):
       perl SplicePL/SplicePL.pm  --forward=read1.fa --reverse=read2.fa --genome=/data/share/db/human/hg19.fa.masked 
       --genome_2bit_filename=hg19.fa.masked.2bit --flanksize=10 --processes=6
	
	2. (optional, not recommended) Use the gfServer/gfClient programs which use 1.2GB of RAM for human genome hg19. 
           In the alignment stage, it will load needed sequences into RAM:
	perl SplicePL/SplicePL.pm  --forward=read1.fa --reverse=read2.fa --genome=/data/share/db/human/hg19.fa.masked 
        --genome_2bit_filename=hg19.fa.masked.2bit --flanksize=10 --processes=14 --gfServer
	
OUTPUT
	The alignment output is stored in the 'temp' directory; the junction files are in the 'output' directory. 
	
	The first column in the junction file is the splice junction. It has three parts:
	1. chromosome name
	2. last nucleotide of the donor exon
	3. first nucleotide of the acceptor exon
	First base of the chromosome starts from 1. For example, it may look like    chr10:102286311-102286732
	

ACKNOWLEDGEMENT
	Using BLAT, instead of short reads aligner such as Bowtie, to align the long reads to the genome was initially  
        suggested by Greg Grant. 

REFERENCE
       Kent, W. J. (2002) Blat-the blast-like alignment tool, Genome Res 12, 656-664.
       Ning, Z., Cox, a. J., and Mullikin, J. C. (2001) SSAHA: a fast search method for large DNA databases.,    
          Genome research 11, 1725-9.	
CONTACT
	If you have any comments and questions about SplicePL, please contact [email protected]


TO DO
     1. convert junction format to bed format [Done. Jul 8, 2010, scripts/junction_to_bed.pl ]
     2. filter junctions by read coverage
     3. convert BLAT psl to BAM format such that RPKM can be calculated using Cufflinks


About

Yet another bioinformatics tool to detect de novo splice junctions from paired-end RNA-seq reads (human genome only)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages