Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counts in ONT data #924

Open
lubitelpospat opened this issue Apr 15, 2024 · 1 comment
Open

Counts in ONT data #924

lubitelpospat opened this issue Apr 15, 2024 · 1 comment

Comments

@lubitelpospat
Copy link

Hello dear Salmon developers,
First of all - thank you very much for your effort in supporting Oxford Nanopore reads! I've been Salmon for quantification of ONT sequencing experiments, and recently I decided to dive deeper into how it produces counts for ONT data. The release with initial ONT support (v1.5.1) states that counts should be 100 for all transcripts because at that time it was not clear how EffectiveLength should be computed. However, now (when using release v1.10.1) Salmon produces some meaningful count estimates. I tried to figure out the algorithm by looking at the code, but failed...
Is there a place where you have the said algorithm documented, or if not, could you please explain how is it implemented?
Thank you in advance!

@JuliaHolz
Copy link

Hello! I did some work with the oxford nanopore error model last summer. There's a blog post about the ONT long read quantification here: https://combine-lab.github.io/salmon-tutorials/2021/ont-long-read-quantification/ . In terms of length correction, the --ont flag basically turns off length correction (since it doesn't really apply to long reads). The error model that the current version of salmon uses for the --ont flag (found in src/ONTAlignmentModel.cpp) basically bins reads by length (into 4 bins by default, I believe). Then for each bin it learns a binomial/geometric distribution for the number of errors (mismatches or indels) in the alignment of the reads in the bin, as well as distributions for the number of bases softclipped at the beginning and end of the read. It then uses these models to penalize reads that have an amount of errors/softclips that is very different from the center of the learned distribution, only if the number of errors/softclips is larger than what we expect for that bin (since a smaller than expected number of errors in the alignment is generally a good, not a bad sign for how likely the read is to map to this transcript). I'm not the original author/creator of this model, so I don't have all the details on specifics of how it works/the design decisions that went into it, but let me know if you have any other questions I can answer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants