"--gcBias" option for single-end reads? #83

DarwinAwardWinner · 2016-08-31T15:40:22Z

Is there any possibility of implementing the --gcBias option for single-end reads? I have a dataset where this would be very useful. I suppose in order to do this you'd need to assign an end point to each transcript whose start point is represented by a single read. You could either just assume every fragment's length is equal to the given mean fragment length, or maybe use the fragment length distribution implied by the mean and SD to compute, for each nucleotide past the end of the read, the probability that the fragment extends to that nucleotide, and then use that as a weight. Obviously this would require a warning to the user to set the fragment length mean and SD parameters to appropriate values for their data.

The text was updated successfully, but these errors were encountered:

DarwinAwardWinner · 2016-08-31T16:04:01Z

Side note: does Salmon learn the fragment length distribution from the data, using the user-provided values as a starting point? Is it even possible to do this for stranded single-end data? (I know it's possible for unstranded single-end.)

rob-p · 2016-08-31T17:10:48Z

Hi Ryan,

The difficulty is, indeed, exactly as you specify. Given a single-end read, one does not know the length of the fragment from which it originates. In this case the "right" thing to do (the best thing we can do) is to consider the read as starting / ending a fragment of every possible length allowed by the user-provided fragment length distribution (with the contribution of each possible fragment weighted by the probability of observing a fragment of that length). In order to make this computationally feasible, one would have to do some clever pre-computation and thing a bit more about how to efficiently update the observed GC model (right now, each mapping contributes a single weight to the model, but under the naive implementation in the single-end case, each mapping would contribute different weights to each bin of the observed GC-bias curve, which would slow things down considerably). Also, as you point out, the quality of the correction would depend somewhat on the user providing appropriate parameters for the fragment length distribution mean and standard deviation — but this seems reasonable in the single-end case. That being said, I'm sure there's a way to handle this efficiently, I'd just have to think about it a bit.

Regarding your second question; Salmon learns the fragment length distribution in paired-end data, but not with single-end data. Single-end data can provide a little bit of information (e.g. there is in upper bound on fragment lengths that one can infer based on single-end reads based on how far they map from the end of the transcript), but not enough information to reliably infer a fragment length distribution.

cc @mikelove in case he has any thoughts on this.

mikelove · 2016-08-31T17:53:51Z

No additional ideas from me. I didn't cover single-end when coding up alpine, but I would go about it just as you describe.

re: "under the naive implementation in the single-end case", I guess the super-naive implementation is to just use the FLD mean to compute the observed GC model (ignore the distribution for this estimation task). It might get close enough. I guess one could test by comparing the GC bias estimates from running paired with read 1 & 2 vs running single end with just 1 or 2.

DarwinAwardWinner · 2016-09-01T19:15:43Z

I would be happy with an approach just using the mean fragment length. I assume it would be better than no GC bias correction at all for data that has GC bias issues.

rob-p · 2016-09-07T18:02:22Z

Alright, I'll have a go at the simple model. @mikelove, once I have it implemented we can figure out a reasonable test.

Actually, enabling the feature was way easier than I thought. The actual bias application code (via re-estimation of effective lengths) can remain the same. I now have code-paths to build GC bias models treating single-end reads as equal to the conditional mean fragment length (given the transcript). Let me know what you think would be a good way to test it.

DarwinAwardWinner · 2016-09-11T06:31:42Z

I would think the best way to test it would be as Mike says: run Salmon on a paired-end library with bias-correction enabled, and then re-run it on only read 1 of the same library, only read 2, and on all the reads but treating it as single-end, and see how close it comes in each case. I would say it's also important to test whether the bias estimation is robust against modest misspecification of the mean fragment size, since that is often not known very accurately.

DarwinAwardWinner · 2016-09-11T06:35:50Z

Unfortunately, I don't necessarily have a great data set to test on, since my motivation for requesting this feature was that I'm working on a single-end dataset.

dmr210 · 2017-04-05T09:45:56Z

Hi,
Sorry for reviving this old thread, but could you confirm that this option was indeed implemented and tested, and let me know which version of Salmon it was implemented on?
Thanks

alexvpickering · 2017-08-06T01:04:46Z

It does seem that this option is implemented on the latest release (0.8.2). With single-end data with the --gcBias flag there is a warning about the implementation being experimental. Have you had a chance to test the results? Would be very interested to hear

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"--gcBias" option for single-end reads? #83

"--gcBias" option for single-end reads? #83

DarwinAwardWinner commented Aug 31, 2016

DarwinAwardWinner commented Aug 31, 2016 •

edited

Loading

rob-p commented Aug 31, 2016 •

edited

Loading

mikelove commented Aug 31, 2016

DarwinAwardWinner commented Sep 1, 2016

rob-p commented Sep 7, 2016 •

edited

Loading

DarwinAwardWinner commented Sep 11, 2016

DarwinAwardWinner commented Sep 11, 2016 •

edited

Loading

dmr210 commented Apr 5, 2017

alexvpickering commented Aug 6, 2017

"--gcBias" option for single-end reads? #83

"--gcBias" option for single-end reads? #83

Comments

DarwinAwardWinner commented Aug 31, 2016

DarwinAwardWinner commented Aug 31, 2016 • edited Loading

rob-p commented Aug 31, 2016 • edited Loading

mikelove commented Aug 31, 2016

DarwinAwardWinner commented Sep 1, 2016

rob-p commented Sep 7, 2016 • edited Loading

DarwinAwardWinner commented Sep 11, 2016

DarwinAwardWinner commented Sep 11, 2016 • edited Loading

dmr210 commented Apr 5, 2017

alexvpickering commented Aug 6, 2017

DarwinAwardWinner commented Aug 31, 2016 •

edited

Loading

rob-p commented Aug 31, 2016 •

edited

Loading

rob-p commented Sep 7, 2016 •

edited

Loading

DarwinAwardWinner commented Sep 11, 2016 •

edited

Loading