-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
custom gtf file #241
Comments
The log from performing Arriba analysis using the original GENCODE19.gtf file would be as follows: On the other hand, when performing Arriba analysis using a custom GTF file, the log would look like this: It seems there is a significant difference in the number of remaining fusion genes after filtering with the blacklist. |
This is because Arriba ignores all fusion candidates that are explained by normal splicing. If you remove transcripts from the annotation, then Arriba may be misled into thinking that reads originating from these normal transcripts are fusions (read-through fusions to be precise). It has nothing to do with the blacklist. Note how the fusion candidates are already higher before the blacklist step. There is no easy way to force Arriba to use canonical transcripts for its annotation. This option is only available for |
Thanks so much for getting back to me! |
Yes, there are many transcripts with introns >10k. You can increase the value using the parameter |
Thank you, Suhrig! |
Every splice junction mentioned in the GTF file is considered normal.
The more comprehensive the annotation is, the more isoforms there are, and the more likely it is that the transcript is considered known/normal by Arriba.
Let's say a gene has two transcripts: a long one and a short one, with the shorter fully encapsulated by the longer one. If you remove the long one from your GTF file, then Arriba will consider any read from the long transcript as a fusion candidate. In fact, the main issue is that by removing transcripts, you shrink the size of the gene. Any reads protuding over the boundaries of the shrunk gene will be considered a fusion candidate. A hacky workaround would be to artificially expand the boundaries of your transcript of interest to the maximum span of all transcripts combined. But that's hardly better than simply increasing the value of
This is the function which extracts reads which may be candidates of read-through fusions or fusions from focal deletions (or in other words, it discards reads explained by annotated transcripts):
This is the function which implements the parameter
|
Thanks, Suhrig. It's perfect. |
When searching for fusion genes using Arriba with a custom GTF file that extracts only the representative Ensembl ID transcripts, a large number of false positives (mostly transcripts fused at intergenic regions) occur. Why is this happening?
The text was updated successfully, but these errors were encountered: