-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use apparent transcript length rather than actual transcript length (feature request) #8
Comments
Maybe I'm missing something, but it seems like the current code for calculating TPM does incorporate the EffectiveLength measure. Does EffectiveLength not take into account the portion of the transcript that is mapped to? |
I think that @sjackman is referring to something even more "subtle" than effective length. The effective length accounts for the ability of all locations on a transcript to generate fragments (according to e.g. the fragment length distribution, and, when modeled, different biases). However, I think what @sjackman is referring to is more akin to simultaneous abundance estimation and "transcript modification". For example, consider I have a transcript sequence in my fasta that is 5kb long, which is highly expressed, but I never see any reads mapping to the last 1kb. In this case, perhaps I actually have a variant of that transcript that is expressed, rather than the sequence in the fasta file. You could also imagine situations like this coming up in de novo assemblies as well, where portions of the assembled contigs are not covered, while others have high coverage, leading a human observer to posit that perhaps there's a mis-assembly. Could something like this be taken into account? Perhaps, but you could imagine why this might become very tricky. |
My particular use case was a gene that has two exons and one intron in reality, and the intron was 90% of the length of the gene, but the annotated transcript missed the annotation of the intron, so appeared 10x larger than it was in truth. |
@rob-p Thanks for that clarification. I actually thought the EffectiveLength measure accounted for this. I guess the situation does become tricky, but maybe the position-specific start distribution data could be helpful in constructing a "baseline" profile of transcript coverage and then comparing each transcript's coverage vector against this would give you a scaling factor to incorporate in the EffectiveLength calculation? |
@mdshw5 — I certainly think that this information could be useful (and bias terms are taken into effect when computing the effective length, when bias modeling is enabled). The problem is that the position-specific start distribution is learned globally (well, conditioned on a few different length classes), rather than being transcript specific. So, it's not exactly clear how it would help too much in Shaun's case, since this is a particular transcript, where a splicing variation is causing a huge portion of the transcript to have no mapped reads. Unless this happens in many transcripts (globally), this particular transcript's contribution to the global position-specific start distribution will likely be rather small. |
When calculating the TPM, it may be an idea to use the length of the transcript that has reads mapped to it rather than the FASTA length of the transcript. It may be difficult to define "the length of the transcript that has reads mapped to it" or require choosing arbitrary thresholds to define what portion of the transcript is transcribed.
See https://twitter.com/sjackman/status/620984740150030336
The text was updated successfully, but these errors were encountered: