[TOC]
deFuse is a software package for gene fusion discovery using RNA-Seq data. The software uses clusters of discordant paired end alignments to inform a split read alignment analysis for finding fusion boundaries. The software also employs a number of heuristic filters in an attempt to reduce the number of false positives and produces a fully annotated output for each predicted fusion.
The deFuse algorithm and results from an application to ovarian tumours and sarcomas have been published in PLoS Computational Biology:
deFuse: An Algorithm for Gene Fusion Discovery in Tumor RNA-Seq Data
deFuse has been used to discover gene fusions in tumour samples for the following publications:
14-3-3 fusion oncogenes in high-grade endometrial stromal sarcoma
Whole-genome sequencing identifies genetic alterations in pediatric low-grade gliomas
Recurrent somatic alterations of FGFR1 and NTRK2 in pilocytic astrocytoma
Poly-gene fusion transcripts and chromothripsis in prostate cancer
Identification of recurrent FGFR3 fusion genes in lung cancer through kinome-centred RNA sequencing
TBL1XR1/TP63: a novel recurrent gene fusion in B-cell non-Hodgkin lymphoma
Recurrent mutations in epigenetic regulators, RHOA and FYN kinase in peripheral T cell lymphomas
If you are having trouble running deFuse, please open an issue on the bitbucket issue tracker.
Four of the granulosa cell tumour datasets analysed in the PLoS Comp. Bio. paper have been submitted to the European Genome-Phenome Archive and can be accessed here. A mapping between IDs used in the PLoS Comp. Bio. paper and the New England Journal of Medicine paper is provided below.
N. Engl. J. Med ID | PLoS Comp. Bio. ID |
---|---|
GCT0026 | GRC4 |
GCT0028 | GRC1 |
GCT0077 | GRC2 |
GCT0078 | GRC3 |
The recommended way of installing deFuse is using conda
. Starting with version v0.8.0
, OSX and Linux versions will be provided on my anaconda channel. The Linux version is compiled using the bioconda docker image for optimal compatibility.
With conda installed, deFuse can be installed by first adding my anaconda channel.
conda config --add channels 'conda-forge'
conda config --add channels https://conda.anaconda.org/dranew
conda config --add channels 'bioconda'
conda config --add channels 'r'
Then install deFuse.
conda install defuse
In the future, deFuse will be added to the bioconda project.
Download and untar/gunzip the deFuse code package to a directory.
To build the deFuse toolset you must have the boost c++ development libraries installed. If they are not installed on your system you can download them from the boost website. A full install of boost is not required. The easiest thing to do is to download the latest boost source tar.gz, extract it, then add the extracted path to the CPLUS_INCLUDE_PATH environment variable (in bash, export CPLUS_INCLUDE_PATH=/boost/directory/:$CPLUS_INCLUDE_PATH
)
To build the deFuse toolset, cd to the tools subdirectory and type make
.
deFuse relies on other publically available tools as part of its pipeline. Some of these tools are not included with the deFuse download. Obtain these tools as detailed below.
Download samtools:
The latest version of samtools can be downloaded from the samtools sourceforge site.
Set the samtools_bin
entry in config.txt
to the fully qualified paths of the samtools
binary.
Download bowtie:
The latest version of bowtie can be downloaded from the bowtie sourceforge site. deFuse has been tested on version 0.12.5
.
Set the bowtie_bin
and bowtie_build_bin
entries in config.txt
to the fully qualified paths of the bowtie
and bowtie-build
binaries.
Download blat and faToTwoBit:
The latest blat tool suite can be downloaded from the ucsc website. Download blat
and faToTwoBit
and set the blat_bin
and fatotwobit_bin
entries in config.txt
to the fully qualified paths of the blat
and faToTwoBit
binaries.
Download GMAP:
The latest version of GMAP can be downloaded from the gmap site. Build with a default configuration. Do not worry about the --with-gmapdb
build flag, deFuse will request a specific directory for the database anyway.
Download R:
The latest version of R can be downloaded from the R project website. Install R and then locate the R and Rscript executables, and set the r_bin
and rscript_bin
entries in config.txt
to the path of those executables.
Install the ada package. Run R, then at the prompt type install.packages("ada")
To build the latest version with conda inside docker run:
bash run_build_conda.sh
To build version 0.8.1 of defuse, use the following command:
docker build ./ --build-arg app_version=0.8.1 -t defuse:0.8.1
Note that the docker build installs inside a container via conda thus a conda package must be built first for the given version.
The reference dataset setup process has been simplified as of deFuse 0.6.0, and deFuse now automatically downloads all required files.
Create a directory for storing the reference dataset and bowtie indices (herein referred to as dataset_directory
).
The defuse_create_ref.pl
script will download the genome and other source files, and build any derivative files including bowtie indices, gmap indices, and 2bit files. Run the following command. Expect this step to take at least 12 hours.
defuse_create_ref.pl -d dataset_directory
The default config file is located at scripts/config.txt
and builds a hg19
reference. To modify the reference or other parameters, copy the config and specify the filename of the modified config on the command line.
defuse_create_ref.pl -d dataset_directory -c myconfig.txt
deFuse now takes only paired fastq as input. To help create paired fastq files, the scripts fq_all2std.pl
, qseq2fastq.pl
are available for export and qseq formats, and the tool bamfastq
is available for bam files.
Important! Using fastqs converted directly from a position sorted bam will cause performance issues. Difficult to align reads will end up in the same alignment jobs, resulting in a slower overall analysis. It is recommended that you sort your bam by read name before converting to fastq.
Running the deFuse pipeline is simple as running a single script. To run the deFuse pipeline, run defuse_run.pl
from the scripts directory with the appropriate command line parameters. The parameters are listed below:
-c, --config
Configuration Filename
-d, --dataset
Directory of reference genome indices
-1, --1fastq
Fastq filename 1
-2, --2fastq
Fastq filename 2
-o, --output
Output Directory
-r, --res
Main results filename (default: results.tsv in Output Directory)
-a, --rescla
Results with a probability column filename (default: results.classify.tsv in Output Directory)
-b, --resfil
Filtered by the probability threshold results filename (default: results.filtered.tsv in Output Directory)
-n, --name
Library Name
-l, --local
Job Local Directory (default: Output Directory)
-s, --submit
Submitter Type (default: direct)
-p, --parallel
Maximum Number of Parallel Jobs (default: 1)
In the simplest case, deFuse can be run as follows:
run_defuse.pl -d dataset_directory -1 reads1.fq -2 reads2.fq -o output_dir
If you have modified the config during building of the reference genome and indices, specify your config on the command line of defuse_run.pl also.
run_defuse.pl -c myconfig.txt -d dataset_directory -1 reads1.fq -2 reads2.fq -o output_dir
With the above parameters, deFuse will use reads from the files reads1.fq
and reads2.fq
the output will be in the output_dir
directory.
note: the output directory should be different from the directory containing reads1.fq
and reads2.fq
The above example will not be the fastest way to run deFuse. Given a machine with multiple processes, 8 for example, run deFuse as follows:
run_defuse.pl -d dataset_directory -1 reads1.fq -2 reads2.fq -o output_dir -p 8
If you want to specify output results' files:
run_defuse.pl -d dataset_directory -1 reads1.fq -2 reads2.fq -o output_dir -r output_dir/myresults.tsv -a path/to/my/dir/myresults_cl.tsv -b path/to/my/dir/myresults_fil.tsv
If you have access to a cluster, you may be able to run deFuse as follows for a sun grid engine (SGE) cluster:
run_defuse.pl -d dataset_directory -1 reads1.fq -2 reads2.fq -o output_dir -s sge
or as follows for a portable batch system (PBS) cluster:
run_defuse.pl -d dataset_directory -1 reads1.fq -2 reads2.fq -o output_dir -s pbs
or as follows for a LSF cluster:
run_defuse.pl -d dataset_directory -1 reads1.fq -2 reads2.fq -o output_dir -s lsf
In many cases it is beneficial to store intermediate results on a local disk rather than a network share. This can be done using the --local
command line parameters as follows:
run_defuse.pl -d dataset_directory -1 reads1.fq -2 reads2.fq -o output_dir -s lsf -l /localdisk
to specify that intermediate files be stored on at /localdisk.
The output directory specified on the command line of run_defuse.pl
will contain the files results.tsv
, results.filtered.tsv
, and results.classify.tsv
. All three files have the same format, though results.classify.tsv
has a probability column from the application of the classifier to results.tsv
, and results.filtered.tsv
has been filtered according to the threshold probability as set in config.tsv
. The file format is tab delimited with one prediction per line, and the following fields per prediction.
- cluster_id : random identifier assigned to each prediction
- library_name : library name given on the command line of deFuse
- gene1 : ensembl id of gene 1
- gene2 : ensembl id of gene 2
- transcript1 : transcript id of gene 1, or 'NA' for non-exonic fusions
- transcript2 : transcript id of gene 2, or 'NA' for non-exonic fusions
- gene_name1 : name of gene 1
- gene_name2 : name of gene 2
- break_predict : breakpoint prediction method, denovo or splitr, that is considered most reliable
- concordant_ratio : proportion of spanning reads considered concordant by blat
- denovo_min_count : minimum kmer count across denovo assembled sequence
- denovo_sequence : fusion sequence predicted by debruijn based denovo sequence assembly
- denovo_span_pvalue : p-value, lower values are evidence the prediction is a false positive
- gene_align_strand1 : alignment strand for spanning read alignments to gene 1
- gene_align_strand2 : alignment strand for spanning read alignments to gene 2
- min_map_count : minimum of the number of genomic mappings for each spanning read
- max_map_count : maximum of the number of genomic mappings for each spanning read
- mean_map_count : average of the number of genomic mappings for each spanning read
- num_multi_map : number of spanning reads that map to more than one genomic location
- span_count : number of spanning reads supporting the fusion
- span_coverage1 : coverage of spanning reads aligned to gene 1 as a proportion of expected coverage
- span_coverage2 : coverage of spanning reads aligned to gene 2 as a proportion of expected coverage
- span_coverage_min : minimum of span_coverage1 and span_coverage2
- span_coverage_max : maximum of span_coverage1 and span_coverage2
- splitr_count : number of split reads supporting the prediction
- splitr_min_pvalue : p-value, lower values are evidence the prediction is a false positive
- splitr_pos_pvalue : p-value, lower values are evidence the prediction is a false positive
- splitr_sequence : fusion sequence predicted by split reads
- splitr_span_pvalue : p-value, lower values are evidence the prediction is a false positive
- adjacent : fusion between adjacent genes
- altsplice : fusion likely the product of alternative splicing between adjacent genes
- break_adj_entropy1 : di-nucleotide entropy of the 40 nucleotides adjacent to the fusion splice in gene 1
- break_adj_entropy2 : di-nucleotide entropy of the 40 nucleotides adjacent to the fusion splice in gene 2
- break_adj_entropy_min : minimum of break_adj_entropy1 and break_adj_entropy2
- breakpoint_homology : number of nucleotides at the fusion splice that align equally well to gene 1 or gene 2
- breakseqs_estislands_percident : maximum percent identity of fusion sequence alignments to est islands
- cdna_breakseqs_percident : maximum percent identity of fusion sequence alignments to cdna
- deletion : fusion produced by a genomic deletion
- est_breakseqs_percident : maximum percent identity of fusion sequence alignments to est
- eversion : fusion produced by a genomic eversion
- exonboundaries : fusion splice at exon boundaries
- expression1 : expression of gene 1 as number of concordant pairs aligned to exons
- expression2 : expression of gene 2 as number of concordant pairs aligned to exons
- gene_chromosome1 : chromosome of gene 1
- gene_chromosome2 : chromosome of gene 2
- gene_end1 : end position for gene 1
- gene_end2 : end position for gene 2
- gene_location1 : location of breakpoint in gene 1
- gene_location2 : location of breakpoint in gene 2
- gene_start1 : start of gene 1
- gene_start2 : start of gene 2
- gene_strand1 : strand of gene 1
- gene_strand2 : strand of gene 2
- genome_breakseqs_percident : maximum percent identity of fusion sequence alignments to genome
- genomic_break_pos1 : genomic position in gene 1 of fusion splice / breakpoint
- genomic_break_pos2 : genomic position in gene 2 of fusion splice / breakpoint
- genomic_strand1 : genomic strand in gene 1 of fusion splice / breakpoint, retained sequence upstream on this strand, breakpoint is downstream
- genomic_strand2 : genomic strand in gene 2 of fusion splice / breakpoint, retained sequence upstream on this strand, breakpoint is downstream
- genomic_starts1 : comma separated list of starts of fusion sequence alignment in gene 1
- genomic_starts2 : comma separated list of starts of fusion sequence alignment in gene 2
- genomic_ends1 : comma separated list of ends of fusion sequence alignment in gene 1
- genomic_ends2 : comma separated list of ends of fusion sequence alignment in gene 2
- interchromosomal : fusion produced by an interchromosomal translocation
- interrupted_index1 : ratio of coverage before and after the fusion splice / breakpoint in gene 1
- interrupted_index2 : ratio of coverage before and after the fusion splice / breakpoint in gene 2
- inversion : fusion produced by genomic inversion
- orf : fusion combines genes in a way that preserves a reading frame
- probability : probability produced by classification using adaboost and example positives/negatives (only given in results.classified.tsv)
- read_through : fusion involving adjacent potentially resulting from co-transcription rather than genome rearrangement
- repeat_proportion1 : proportion of the spanning reads in gene 1 that span a repeat region
- repeat_proportion2 : proportion of the spanning reads in gene 2 that span a repeat region
- max_repeat_proportion : max of repeat_proportion1 and repeat_proportion2
- splice_score : number of nucleotides similar to GTAG at fusion splice
- num_splice_variants : number of potential splice variants for this gene pair
- splicing_index1 : number of concordant pairs in gene 1 spanning the fusion splice / breakpoint, divided by number of spanning reads supporting the fusion with gene 2
- splicing_index2 : number of concordant pairs in gene 2 spanning the fusion splice / breakpoint, divided by number of spanning reads supporting the fusion with gene 1
Supporting reads for a given fusion prediction can be obtained using the get_reads.pl
script. For a prediction with cluster_id=123
run the following command to obtain the spanning and split reads supporting the fusion.
get_reads.pl -c config.txt -o output_dir -i 123
Split reads are output first as demarcated by the heading Split Reads:
. The third line of the output corresponds to the approximate sequence of the fusion boundary to which potential split reads are aligned. Subsequent lines alternate between showing a read id and split alignment.
Spanning reads are output second as demarcated by the heading Spanning Reads:
. The spanning read information is tab separated and provides information about the alignments of supporting spanning reads. The fields are:
- read id
- encoded read end (0 -> 1, 1 -> 2)
- chromosome of alignment
- strand of alignment
- start of alignment
- end of alignment
To obtain the original sequence for spanning reads, search the reads.1.fastq file and reads.2.fastq file in the output directory for the fastq entry with the associated read id.
You cannot run get_reads.pl if you have removed the temporary files from the output folder. The following files are required:
- concordant.read.stats
- clusters.sc
- reads.split.catalog
- job/reads.*.improper.sam
- job/reads.*.spanning.filelist
- job/reads..spanning/