Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"--gcBias" option for single-end reads? #83

Open
DarwinAwardWinner opened this issue Aug 31, 2016 · 9 comments
Open

"--gcBias" option for single-end reads? #83

DarwinAwardWinner opened this issue Aug 31, 2016 · 9 comments

Comments

@DarwinAwardWinner
Copy link

Is there any possibility of implementing the --gcBias option for single-end reads? I have a dataset where this would be very useful. I suppose in order to do this you'd need to assign an end point to each transcript whose start point is represented by a single read. You could either just assume every fragment's length is equal to the given mean fragment length, or maybe use the fragment length distribution implied by the mean and SD to compute, for each nucleotide past the end of the read, the probability that the fragment extends to that nucleotide, and then use that as a weight. Obviously this would require a warning to the user to set the fragment length mean and SD parameters to appropriate values for their data.

@DarwinAwardWinner
Copy link
Author

DarwinAwardWinner commented Aug 31, 2016

Side note: does Salmon learn the fragment length distribution from the data, using the user-provided values as a starting point? Is it even possible to do this for stranded single-end data? (I know it's possible for unstranded single-end.)

@rob-p
Copy link
Collaborator

rob-p commented Aug 31, 2016

Hi Ryan,

The difficulty is, indeed, exactly as you specify. Given a single-end read, one does not know the length of the fragment from which it originates. In this case the "right" thing to do (the best thing we can do) is to consider the read as starting / ending a fragment of every possible length allowed by the user-provided fragment length distribution (with the contribution of each possible fragment weighted by the probability of observing a fragment of that length). In order to make this computationally feasible, one would have to do some clever pre-computation and thing a bit more about how to efficiently update the observed GC model (right now, each mapping contributes a single weight to the model, but under the naive implementation in the single-end case, each mapping would contribute different weights to each bin of the observed GC-bias curve, which would slow things down considerably). Also, as you point out, the quality of the correction would depend somewhat on the user providing appropriate parameters for the fragment length distribution mean and standard deviation — but this seems reasonable in the single-end case. That being said, I'm sure there's a way to handle this efficiently, I'd just have to think about it a bit.

Regarding your second question; Salmon learns the fragment length distribution in paired-end data, but not with single-end data. Single-end data can provide a little bit of information (e.g. there is in upper bound on fragment lengths that one can infer based on single-end reads based on how far they map from the end of the transcript), but not enough information to reliably infer a fragment length distribution.

cc @mikelove in case he has any thoughts on this.

@mikelove
Copy link
Collaborator

No additional ideas from me. I didn't cover single-end when coding up alpine, but I would go about it just as you describe.

re: "under the naive implementation in the single-end case", I guess the super-naive implementation is to just use the FLD mean to compute the observed GC model (ignore the distribution for this estimation task). It might get close enough. I guess one could test by comparing the GC bias estimates from running paired with read 1 & 2 vs running single end with just 1 or 2.

@DarwinAwardWinner
Copy link
Author

I would be happy with an approach just using the mean fragment length. I assume it would be better than no GC bias correction at all for data that has GC bias issues.

@rob-p
Copy link
Collaborator

rob-p commented Sep 7, 2016

Alright, I'll have a go at the simple model. @mikelove, once I have it implemented we can figure out a reasonable test.

Actually, enabling the feature was way easier than I thought. The actual bias application code (via re-estimation of effective lengths) can remain the same. I now have code-paths to build GC bias models treating single-end reads as equal to the conditional mean fragment length (given the transcript). Let me know what you think would be a good way to test it.

@DarwinAwardWinner
Copy link
Author

I would think the best way to test it would be as Mike says: run Salmon on a paired-end library with bias-correction enabled, and then re-run it on only read 1 of the same library, only read 2, and on all the reads but treating it as single-end, and see how close it comes in each case. I would say it's also important to test whether the bias estimation is robust against modest misspecification of the mean fragment size, since that is often not known very accurately.

@DarwinAwardWinner
Copy link
Author

DarwinAwardWinner commented Sep 11, 2016

Unfortunately, I don't necessarily have a great data set to test on, since my motivation for requesting this feature was that I'm working on a single-end dataset.

@dmr210
Copy link

dmr210 commented Apr 5, 2017

Hi,
Sorry for reviving this old thread, but could you confirm that this option was indeed implemented and tested, and let me know which version of Salmon it was implemented on?
Thanks

@alexvpickering
Copy link
Contributor

It does seem that this option is implemented on the latest release (0.8.2). With single-end data with the --gcBias flag there is a warning about the implementation being experimental. Have you had a chance to test the results? Would be very interested to hear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants