Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

custom gtf file #241

Closed
sotaro-kanematsu opened this issue Jun 11, 2024 · 7 comments
Closed

custom gtf file #241

sotaro-kanematsu opened this issue Jun 11, 2024 · 7 comments

Comments

@sotaro-kanematsu
Copy link

When searching for fusion genes using Arriba with a custom GTF file that extracts only the representative Ensembl ID transcripts, a large number of false positives (mostly transcripts fused at intergenic regions) occur. Why is this happening?

@sotaro-kanematsu
Copy link
Author

The log from performing Arriba analysis using the original GENCODE19.gtf file would be as follows:
[2024-06-11T07:01:50] Launching Arriba 2.3.0
[2024-06-11T07:01:50] Loading assembly from '/run/media/skanematsu/kanematsu/RNASeq/arriba/hs37d5viral.fa'
[2024-06-11T07:02:13] Loading annotation from '/run/media/skanematsu/kanematsu/RNASeq/arriba/GENCODE19.gtf'
[2024-06-11T07:02:19] Reading chimeric alignments from 'RS10182N.Aligned.sortedByCoord.out.bam' (total=WARNING: 10994 SAM records were malformed and ignored
1247696)
[2024-06-11T07:05:45] Marking multi-mapping alignments (marked=511790)
[2024-06-11T07:05:46] Detecting strandedness (reverse)
[2024-06-11T07:05:46] Assigning strands to alignments
[2024-06-11T07:05:46] Annotating alignments
[2024-06-11T07:05:53] Filtering duplicates (remaining=824639)
[2024-06-11T07:05:54] Filtering mates which do not map to interesting contigs (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y AC_* NC_) (remaining=774944)
[2024-06-11T07:05:55] Filtering mates which only map to viral contigs (AC_
NC_*) (remaining=774573)
[2024-06-11T07:05:55] Filtering viral contigs with expression lower than the top 5 (remaining=733124)
[2024-06-11T07:05:56] Filtering viral contigs with less than 5% coverage (remaining=731419)
[2024-06-11T07:05:56] Estimating fragment length (mate gap mean=-45.8894, mate gap stddev=65.354, read length mean=144.576)
[2024-06-11T07:05:57] Filtering read-through fragments with a distance <=10000bp (remaining=689076)
[2024-06-11T07:05:57] Filtering inconsistently clipped mates (remaining=663971)
[2024-06-11T07:05:57] Filtering breakpoints adjacent to homopolymers >=6nt (remaining=561967)
[2024-06-11T07:05:58] Filtering fragments with small insert size (remaining=561105)
[2024-06-11T07:05:58] Filtering alignments with long gaps (remaining=561105)
[2024-06-11T07:05:59] Filtering fragments with both mates in the same gene (remaining=560669)
[2024-06-11T07:05:59] Filtering fusions arising from hairpin structures (remaining=502984)
[2024-06-11T07:05:59] Filtering reads with a mismatch p-value <=0.01 (remaining=254636)
[2024-06-11T07:06:03] Filtering reads with low entropy (k-mer content >=60%) (remaining=251227)
[2024-06-11T07:06:05] Finding fusions and counting supporting reads (total=WARNING: some fusions were subsampled, because they have more than 300 supporting reads
226495)
[2024-06-11T07:06:12] Merging adjacent fusion breakpoints (remaining=225892)
[2024-06-11T07:06:12] Filtering multi-mapping fusions by alignment score and read support (remaining=197815)
[2024-06-11T07:06:19] Estimating expected number of fusions by random chance (e-value)
[2024-06-11T07:06:21] Filtering fusions with both breakpoints in adjacent non-coding/intergenic regions (remaining=195545)
[2024-06-11T07:06:21] Filtering intragenic fusions with both breakpoints in exonic regions (remaining=182346)
[2024-06-11T07:06:21] Filtering fusions with <2 supporting reads (remaining=71778)
[2024-06-11T07:06:21] Filtering fusions with an e-value >=0.3 (remaining=62904)
[2024-06-11T07:06:21] Searching for internal tandem duplications <=100bp with >=10 supporting reads and >=7% allele fraction (remaining=62923)
[2024-06-11T07:06:21] Filtering fusions with both breakpoints in intronic/intergenic regions (remaining=62694)
[2024-06-11T07:06:21] Searching for known fusions in '/run/media/skanematsu/kanematsu/RNASeq/arriba/database/known_fusions_hg19_hs37d5_GRCh37_v2.3.0.tsv.gz' (remaining=62695)
[2024-06-11T07:06:22] Filtering in vitro-generated fusions between genes with an expression above the 99.8% quantile (remaining=4691)
[2024-06-11T07:06:25] Searching for fusions with spliced split reads (remaining=4766)
[2024-06-11T07:06:27] Selecting best breakpoints from genes with multiple breakpoints (remaining=1095)
[2024-06-11T07:06:27] Filtering read-through fusions with breakpoints near the gene boundary (remaining=996)
[2024-06-11T07:06:27] Searching for fusions with >=4 spliced events (remaining=1046)
[2024-06-11T07:06:27] Filtering blacklisted fusions in '/run/media/skanematsu/kanematsu/RNASeq/arriba/database/blacklist_hg19_hs37d5_GRCh37_v2.3.0.tsv.gz' (remaining=153)
[2024-06-11T07:06:48] Filtering fusions with anchors <=23nt (remaining=129)
[2024-06-11T07:06:48] Filtering end-to-end fusions with low support (remaining=125)
[2024-06-11T07:06:48] Filtering fusions with no coverage around the breakpoints (remaining=111)
[2024-06-11T07:06:49] Indexing gene sequences
[2024-06-11T07:06:51] Filtering genes with >=30% identity (remaining=66)
[2024-06-11T07:06:51] Re-aligning chimeric reads to filter fusions with >=80% mis-mappers (remaining=39)
[2024-06-11T07:06:51] Selecting best breakpoints from genes with multiple breakpoints (remaining=39)
[2024-06-11T07:06:51] Searching for additional isoforms (remaining=43)
[2024-06-11T07:06:51] Assigning confidence scores to events
[2024-06-11T07:06:52] Loading tags from '/run/media/skanematsu/kanematsu/RNASeq/arriba/database/known_fusions_hg19_hs37d5_GRCh37_v2.3.0.tsv.gz'
[2024-06-11T07:06:52] Loading protein domains from '/run/media/skanematsu/kanematsu/RNASeq/arriba/database/protein_domains_hg19_hs37d5_GRCh37_v2.3.0.gff3'
WARNING: unknown gene: RP11-145E5.5 ENSG00000264545
[2024-06-11T07:06:52] Writing fusions to file '/run/media/skanematsu/kanematsu/RNASeq/230314_A01054_0208_AH23YNDSX7/arriba_result_3/RS10182N.fusions.tsv'
WARNING: encountered early stop codon in transcript ENST00000476412.1 at amino acid 2 (error in GTF file?) => predicted peptide sequence may be wrong
WARNING: encountered early stop codon in transcript ENST00000476412.1 at amino acid 2 (error in GTF file?) => predicted peptide sequence may be wrong
WARNING: encountered early stop codon in transcript ENST00000476412.1 at amino acid 2 (error in GTF file?) => predicted peptide sequence may be wrong
[2024-06-11T07:06:53] Writing discarded fusions to file '/run/media/skanematsu/kanematsu/RNASeq/230314_A01054_0208_AH23YNDSX7/arriba_result_3/RS10182N.fusions.discarded.tsv'
[2024-06-11T07:07:05] Freeing resources
[2024-06-11T07:07:11] Done (elapsed time=00:05:21, CPU time=00:05:20, peak memory=6.06gb)

On the other hand, when performing Arriba analysis using a custom GTF file, the log would look like this:
[2024-06-11T06:53:33] Launching Arriba 2.3.0
[2024-06-11T06:53:33] Loading assembly from '/run/media/skanematsu/kanematsu/RNASeq/arriba/hs37d5viral.fa'
[2024-06-11T06:53:55] Loading annotation from '/run/media/skanematsu/kanematsu/RNASeq/arriba/filtered_GENCODE19.gtf'
[2024-06-11T06:53:57] Reading chimeric alignments from 'RS10182N.Aligned.sortedByCoord.out.bam' (total=WARNING: 10994 SAM records were malformed and ignored
1426896)
[2024-06-11T06:57:11] Marking multi-mapping alignments (marked=519289)
[2024-06-11T06:57:12] Detecting strandedness (reverse)
[2024-06-11T06:57:12] Assigning strands to alignments
[2024-06-11T06:57:13] Annotating alignments
[2024-06-11T06:57:19] Filtering duplicates (remaining=944846)
[2024-06-11T06:57:20] Filtering mates which do not map to interesting contigs (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y AC_* NC_) (remaining=895151)
[2024-06-11T06:57:21] Filtering mates which only map to viral contigs (AC_
NC_*) (remaining=894780)
[2024-06-11T06:57:21] Filtering viral contigs with expression lower than the top 5 (remaining=853331)
[2024-06-11T06:57:22] Filtering viral contigs with less than 5% coverage (remaining=851626)
[2024-06-11T06:57:23] Estimating fragment length (mate gap mean=-46.5587, mate gap stddev=63.5323, read length mean=144.618)
[2024-06-11T06:57:23] Filtering read-through fragments with a distance <=10000bp (remaining=714135)
[2024-06-11T06:57:23] Filtering inconsistently clipped mates (remaining=688587)
[2024-06-11T06:57:23] Filtering breakpoints adjacent to homopolymers >=6nt (remaining=586289)
[2024-06-11T06:57:24] Filtering fragments with small insert size (remaining=585427)
[2024-06-11T06:57:25] Filtering alignments with long gaps (remaining=585427)
[2024-06-11T06:57:25] Filtering fragments with both mates in the same gene (remaining=585009)
[2024-06-11T06:57:25] Filtering fusions arising from hairpin structures (remaining=527333)
[2024-06-11T06:57:26] Filtering reads with a mismatch p-value <=0.01 (remaining=277527)
[2024-06-11T06:57:29] Filtering reads with low entropy (k-mer content >=60%) (remaining=273016)
[2024-06-11T06:57:32] Finding fusions and counting supporting reads (total=WARNING: some fusions were subsampled, because they have more than 300 supporting reads
227255)
[2024-06-11T06:57:40] Merging adjacent fusion breakpoints (remaining=226656)
[2024-06-11T06:57:41] Filtering multi-mapping fusions by alignment score and read support (remaining=199647)
[2024-06-11T06:57:49] Estimating expected number of fusions by random chance (e-value)
[2024-06-11T06:57:50] Filtering fusions with both breakpoints in adjacent non-coding/intergenic regions (remaining=196360)
[2024-06-11T06:57:50] Filtering intragenic fusions with both breakpoints in exonic regions (remaining=184402)
[2024-06-11T06:57:51] Filtering fusions with <2 supporting reads (remaining=74880)
[2024-06-11T06:57:51] Filtering fusions with an e-value >=0.3 (remaining=64578)
[2024-06-11T06:57:51] Searching for internal tandem duplications <=100bp with >=10 supporting reads and >=7% allele fraction (remaining=64597)
[2024-06-11T06:57:51] Filtering fusions with both breakpoints in intronic/intergenic regions (remaining=64258)
[2024-06-11T06:57:51] Searching for known fusions in '/run/media/skanematsu/kanematsu/RNASeq/arriba/database/known_fusions_hg19_hs37d5_GRCh37_v2.3.0.tsv.gz' (remaining=64259)
[2024-06-11T06:57:52] Filtering in vitro-generated fusions between genes with an expression above the 99.8% quantile (remaining=5690)
[2024-06-11T06:57:56] Searching for fusions with spliced split reads (remaining=5753)
[2024-06-11T06:57:58] Selecting best breakpoints from genes with multiple breakpoints (remaining=1317)
[2024-06-11T06:57:58] Filtering read-through fusions with breakpoints near the gene boundary (remaining=1222)
[2024-06-11T06:57:58] Searching for fusions with >=4 spliced events (remaining=1257)
[2024-06-11T06:57:58] Filtering blacklisted fusions in '/run/media/skanematsu/kanematsu/RNASeq/arriba/database/blacklist_hg19_hs37d5_GRCh37_v2.3.0.tsv.gz' (remaining=396)
[2024-06-11T06:58:19] Filtering fusions with anchors <=23nt (remaining=374)
[2024-06-11T06:58:19] Filtering end-to-end fusions with low support (remaining=346)
[2024-06-11T06:58:19] Filtering fusions with no coverage around the breakpoints (remaining=309)
[2024-06-11T06:58:19] Indexing gene sequences
[2024-06-11T06:58:25] Filtering genes with >=30% identity (remaining=272)
[2024-06-11T06:58:25] Re-aligning chimeric reads to filter fusions with >=80% mis-mappers (remaining=254)
[2024-06-11T06:58:27] Selecting best breakpoints from genes with multiple breakpoints (remaining=254)
[2024-06-11T06:58:27] Searching for additional isoforms (remaining=258)
[2024-06-11T06:58:27] Assigning confidence scores to events
[2024-06-11T06:58:28] Loading tags from '/run/media/skanematsu/kanematsu/RNASeq/arriba/database/known_fusions_hg19_hs37d5_GRCh37_v2.3.0.tsv.gz'
[2024-06-11T06:58:28] Loading protein domains from '/run/media/skanematsu/kanematsu/RNASeq/arriba/database/protein_domains_hg19_hs37d5_GRCh37_v2.3.0.gff3'
WARNING: unknown gene: RP11-145E5.5 ENSG00000264545
[2024-06-11T06:58:28] Writing fusions to file '/run/media/skanematsu/kanematsu/RNASeq/230314_A01054_0208_AH23YNDSX7/arriba_result_2/RS10182N.fusions.tsv'
[2024-06-11T06:58:34] Writing discarded fusions to file '/run/media/skanematsu/kanematsu/RNASeq/230314_A01054_0208_AH23YNDSX7/arriba_result_2/RS10182N.fusions.discarded.tsv'
[2024-06-11T06:58:46] Freeing resources
[2024-06-11T06:58:53] Done (elapsed time=00:05:20, CPU time=00:05:18, peak memory=6.1gb)

It seems there is a significant difference in the number of remaining fusion genes after filtering with the blacklist.

@suhrig
Copy link
Owner

suhrig commented Jun 11, 2024

This is because Arriba ignores all fusion candidates that are explained by normal splicing. If you remove transcripts from the annotation, then Arriba may be misled into thinking that reads originating from these normal transcripts are fusions (read-through fusions to be precise). It has nothing to do with the blacklist. Note how the fusion candidates are already higher before the blacklist step.

There is no easy way to force Arriba to use canonical transcripts for its annotation. This option is only available for draw_fusions.R. At best, you could add the splice junctions from the transcripts you removed to the blacklist.

@sotaro-kanematsu
Copy link
Author

sotaro-kanematsu commented Jun 13, 2024

Thanks so much for getting back to me!
in short, the step filtering read-through fragments with a distance <=10000bp is incomplete?

@suhrig
Copy link
Owner

suhrig commented Jun 13, 2024

Yes, there are many transcripts with introns >10k. You can increase the value using the parameter -R. Beware that large values run the risk of missing fusions from focal deletions. For example, if you set it to 500kb, then Arriba will omit fusions arising from chromosomal deletions of 500kb or less. There are a few known examples of relevant fusions in this range.

@sotaro-kanematsu
Copy link
Author

Thank you, Suhrig!
I am interested in the following points:
How does Arriba distinguish whether a splicing event is a normal molecular mechanism or not? How does using multiple transcript isoforms help exclude these normal splicing events? Additionally, how does the use of multiple transcript isoforms shorten the length of read-through type splicing events? If possible, it would be helpful if you could point out the part of the script that determines this.

@suhrig
Copy link
Owner

suhrig commented Jun 18, 2024

How does Arriba distinguish whether a splicing event is a normal molecular mechanism or not?

Every splice junction mentioned in the GTF file is considered normal.

How does using multiple transcript isoforms help exclude these normal splicing events?

The more comprehensive the annotation is, the more isoforms there are, and the more likely it is that the transcript is considered known/normal by Arriba.

how does the use of multiple transcript isoforms shorten the length of read-through type splicing events?

Let's say a gene has two transcripts: a long one and a short one, with the shorter fully encapsulated by the longer one. If you remove the long one from your GTF file, then Arriba will consider any read from the long transcript as a fusion candidate. In fact, the main issue is that by removing transcripts, you shrink the size of the gene. Any reads protuding over the boundaries of the shrunk gene will be considered a fusion candidate. A hacky workaround would be to artificially expand the boundaries of your transcript of interest to the maximum span of all transcripts combined. But that's hardly better than simply increasing the value of -R.

If you could point out the part of the script that determines this

This is the function which extracts reads which may be candidates of read-through fusions or fusions from focal deletions (or in other words, it discards reads explained by annotated transcripts):

bool extract_read_through_alignment(chimeric_alignments_t& chimeric_alignments, const string& read_name, bam1_t* forward_mate, bam1_t* reverse_mate, const gene_annotation_index_t& gene_annotation_index, const bool separate_chimeric_bam_file) {

This is the function which implements the parameter -R:

unsigned int filter_proximal_read_through(chimeric_alignments_t& chimeric_alignments, const int min_distance) {

@sotaro-kanematsu
Copy link
Author

Thanks, Suhrig. It's perfect.
I understood the draft for selecting fusion transcripts in arriba.

@suhrig suhrig closed this as completed Jul 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants