-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Method to keep positions in pileup denominator when no modification probability is reported (i.e. treat probability as zero if unreported) #106
Comments
Hello @OberonDixon You're correct in your observation that Sounds to me like you have a strong prior belief that most of the bases are canonical, so why not model that explicitly? For example, you could set a prior on the fraction of reads that will call a position as modified as If a position has no base modification probabilities then the posterior mean is equal to the prior mean (which is more likely canonical then modified). I appreciate my suggestion doesn't bring the two programs into congruency, however I think it captures the essence of what you're trying to do here. You may want to follow up on why some positions are missing base modification probabilities. There are a couple reasons I can think of:
Hope this helps, happy to answer any additional questions. |
Thank you for the detailed response, this is helpful. I have proven to my satisfaction that the differences between my prior processing code and the modkit implementation is small enough that I can move forward using modkit for now. Regarding the lack of modification calls for some CGs, our pipeline does use reference anchoring (i.e. the base call is always that from the reference, and we just call modifications) but I do not actually know how the basecaller handles cases when the base is very much not the expected canonical base from the reference, I would need to read more documentation. It's possible the modification call may be left empty in that case, even though the expected canonical base is reported by definition. When I was examining the missing C methylation probabilities for CGs they weren't at the beginning or end of reads, and moreover our basecaller should call 5mC in any context. So it remains a bit mysterious to me at the moment, but will not be a high priority currently. I am satisfied that I know how modkit does the processing and I think the assumptions are reasonable. As you say, if I want I can set some prior for bases with no modification call reported. |
I am trying to compare the outputs for modkit v 0.2.2 to some custom bam pileup infrastructure I built previously, before the new modkit arbitrary motif functionality was released. I am getting almost identical results, but with some small differences in valid motif position counts that I believe is caused by a small fraction of megalodon base caller outputs having no reported modification probability (for cytosine or adenine). Ideally I would like to be able to force modkit to recognize bases with no modification call as having probability zero, so I can do a one-to-one comparison with my old pipeline. I might then choose to use modkit in the standard mode, once I can confirm that holding everything constant I get identical results.
Here's what I've been running to see these results:
I am running with a fixed
--mod-thresholds
and accept any base calling quality score, because that is how my older pipeline was set up. I tested on CpG and adenine methylation using the following command structure:My comparison regions is just 10MB in chr1.
When I look at my modified base counts, comparing to my old processing code using a modification call threshold of 230 (i.e. 0.9), I get matching at all 10 million positions for both CpG and A. This is taking the third number from the ninth field in the .bed file, so one for the first row and zero for the second row below.
However, when I look at the first number from the ninth field, i.e. the denominator for the pileup methylation fraction, I get the same result only at about 9,999,900 out of 10,000,000 positions for CpG and 9,999,300 out of 10,000,000 positions for A. Moreover, when I check the actual reads at the given position using pysam.AlignmentFile.fetch() I see that my processing code is matching the number of actual reads whereas, in the 0.001-0.01% of mismatches, modkit is missing a read.
When I look at the
modkit extract
output, I see that there is no reported modification state on one of the two reads for chr1 coordinate 40268863, one of the ~100 mistmatching positions. This is true at others that I check also.This seems to be ultimately because the modbam file itself does not report anything for this CG motif.
I was under the impression that megalodon should report for every single base, given that we specified a reporting threshold of zero when running base and mod calling, but instead it seems to miss a small fraction. Let me know if you think I'm missing something in my examination of this issue, and if there are ways I can get modkit to process the data in a way that allows me to make side-by-side comparisons.
The text was updated successfully, but these errors were encountered: