Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use apparent transcript length rather than actual transcript length (feature request) #8

Open
sjackman opened this issue Jul 14, 2015 · 5 comments

Comments

@sjackman
Copy link
Contributor

When calculating the TPM, it may be an idea to use the length of the transcript that has reads mapped to it rather than the FASTA length of the transcript. It may be difficult to define "the length of the transcript that has reads mapped to it" or require choosing arbitrary thresholds to define what portion of the transcript is transcribed.

See https://twitter.com/sjackman/status/620984740150030336

@mdshw5
Copy link
Contributor

mdshw5 commented Aug 5, 2016

Maybe I'm missing something, but it seems like the current code for calculating TPM does incorporate the EffectiveLength measure. Does EffectiveLength not take into account the portion of the transcript that is mapped to?

@rob-p
Copy link
Collaborator

rob-p commented Aug 5, 2016

I think that @sjackman is referring to something even more "subtle" than effective length. The effective length accounts for the ability of all locations on a transcript to generate fragments (according to e.g. the fragment length distribution, and, when modeled, different biases). However, I think what @sjackman is referring to is more akin to simultaneous abundance estimation and "transcript modification". For example, consider I have a transcript sequence in my fasta that is 5kb long, which is highly expressed, but I never see any reads mapping to the last 1kb. In this case, perhaps I actually have a variant of that transcript that is expressed, rather than the sequence in the fasta file. You could also imagine situations like this coming up in de novo assemblies as well, where portions of the assembled contigs are not covered, while others have high coverage, leading a human observer to posit that perhaps there's a mis-assembly. Could something like this be taken into account? Perhaps, but you could imagine why this might become very tricky.

@sjackman
Copy link
Contributor Author

sjackman commented Aug 5, 2016

My particular use case was a gene that has two exons and one intron in reality, and the intron was 90% of the length of the gene, but the annotated transcript missed the annotation of the intron, so appeared 10x larger than it was in truth.

@mdshw5
Copy link
Contributor

mdshw5 commented Aug 5, 2016

@rob-p Thanks for that clarification. I actually thought the EffectiveLength measure accounted for this. I guess the situation does become tricky, but maybe the position-specific start distribution data could be helpful in constructing a "baseline" profile of transcript coverage and then comparing each transcript's coverage vector against this would give you a scaling factor to incorporate in the EffectiveLength calculation?

@rob-p
Copy link
Collaborator

rob-p commented Aug 5, 2016

@mdshw5 — I certainly think that this information could be useful (and bias terms are taken into effect when computing the effective length, when bias modeling is enabled). The problem is that the position-specific start distribution is learned globally (well, conditioned on a few different length classes), rather than being transcript specific. So, it's not exactly clear how it would help too much in Shaun's case, since this is a particular transcript, where a splicing variation is causing a huge portion of the transcript to have no mapped reads. Unless this happens in many transcripts (globally), this particular transcript's contribution to the global position-specific start distribution will likely be rather small.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants