Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Method to keep positions in pileup denominator when no modification probability is reported (i.e. treat probability as zero if unreported) #106

Closed
OberonDixon opened this issue Jan 2, 2024 · 2 comments

Comments

@OberonDixon
Copy link

I am trying to compare the outputs for modkit v 0.2.2 to some custom bam pileup infrastructure I built previously, before the new modkit arbitrary motif functionality was released. I am getting almost identical results, but with some small differences in valid motif position counts that I believe is caused by a small fraction of megalodon base caller outputs having no reported modification probability (for cytosine or adenine). Ideally I would like to be able to force modkit to recognize bases with no modification call as having probability zero, so I can do a one-to-one comparison with my old pipeline. I might then choose to use modkit in the standard mode, once I can confirm that holding everything constant I get identical results.

Here's what I've been running to see these results:

I am running with a fixed --mod-thresholds and accept any base calling quality score, because that is how my older pipeline was set up. I tested on CpG and adenine methylation using the following command structure:

#modkit pileup command attempting to replicate my processing results
modkit pileup megalodon_updated_tags.bam output.bed \
--include-bed comparison_region.bed \
--motif CG 0 --motif A 0 \
--ref chm13.draft_v1.1.fasta \
--log-filepath /pileup.log \
--filter-threshold 0 --mod-thresholds m:0.9 --mod-thresholds a:0.9

My comparison regions is just 10MB in chr1.

chr1	40000000	50000000	.	.	.

When I look at my modified base counts, comparing to my old processing code using a modification call threshold of 230 (i.e. 0.9), I get matching at all 10 million positions for both CpG and A. This is taking the third number from the ninth field in the .bed file, so one for the first row and zero for the second row below.

#lists loaded from bedmethyl output file
['chr1', '40000627', '40000628', 'm,CG,0', '1', '+', '40000627', '40000628', '255,0,0', '1 100.00 1 0 0 0 0 0 0\n']
['chr1', '40000628', '40000629', 'm,CG,0', '1', '-', '40000628', '40000629', '255,0,0', '1 0.00 0 1 0 0 0 0 0\n']

However, when I look at the first number from the ninth field, i.e. the denominator for the pileup methylation fraction, I get the same result only at about 9,999,900 out of 10,000,000 positions for CpG and 9,999,300 out of 10,000,000 positions for A. Moreover, when I check the actual reads at the given position using pysam.AlignmentFile.fetch() I see that my processing code is matching the number of actual reads whereas, in the 0.001-0.01% of mismatches, modkit is missing a read.

base = genome_fasta.fetch('chr1',start_co+A_coord_rel,start_co+A_coord_rel+1)
read_counter = 0
for read in modbam.fetch('chr1',start_co+A_coord_rel,start_co+A_coord_rel+1):        
    if base=='A' and read.is_forward or base=='T' and read.is_reverse:
        read_counter+=1

When I look at the modkit extract output, I see that there is no reported modification state on one of the two reads for chr1 coordinate 40268863, one of the ~100 mistmatching positions. This is true at others that I check also.

read_id	forward_read_position	ref_position	chrom	mod_strand	ref_strand	ref_mod_strand	fw_soft_clipped_start	fw_soft_clipped_end	read_length	mod_qual	mod_code	base_qual	ref_kmer	query_kmer	canonical_base	modified_primary_base	inferred
54ad7ffa-3752-4e74-a5bd-eade87757141	19	40268860	chr1	+	-	-	0	0	13662	0.001953125	a	255	.	GTAGT	A	A	false
#chr1 40268863 has a CpG on this read - which is correct
54ad7ffa-3752-4e74-a5bd-eade87757141	16	40268863	chr1	+	-	-	0	0	13662	0.001953125	m	255	.	CACGT	C	C	false
54ad7ffa-3752-4e74-a5bd-eade87757141	15	40268864	chr1	+	-	-	0	0	13662	0.001953125	a	255	.	ACACG	A	A	false
e7491c1a-9df8-4b0b-b8a2-16828976e1f2	5366	40268855	chr1	+	-	-	0	0	18185	0.001953125	m	255	.	GGCAC	C	C	false
e7491c1a-9df8-4b0b-b8a2-16828976e1f2	5361	40268860	chr1	+	-	-	0	0	18185	0.001953125	a	255	.	GTAGT	A	A	false
#somehow there is no CpG on this second read at chr1 40268863, according to modkit extract
e7491c1a-9df8-4b0b-b8a2-16828976e1f2	5340	40268881	chr1	+	-	-	0	0	18185	0.001953125	a	255	.	AAAAA	A	A	false

This seems to be ultimately because the modbam file itself does not report anything for this CG motif.

modbam = pysam.AlignmentFile('file.bam')
for read in modbam.fetch('chr1',40268863,40268864):
    if read.query_name == 'e7491c1a-9df8-4b0b-b8a2-16828976e1f2':
        print(read.modified_bases['C',1,'Z'][1243:1247])
        print(read.get_reference_positions()[read.query_length-5358-1])
        print(read.get_forward_sequence()[5358:5360])

[(5351, 0), (5354, 2), (5362, 0), (5365, 0)]
40268863
CG

I was under the impression that megalodon should report for every single base, given that we specified a reporting threshold of zero when running base and mod calling, but instead it seems to miss a small fraction. Let me know if you think I'm missing something in my examination of this issue, and if there are ways I can get modkit to process the data in a way that allows me to make side-by-side comparisons.

@ArtRand
Copy link
Contributor

ArtRand commented Jan 3, 2024

Hello @OberonDixon

You're correct in your observation that modkit will not assume the "absence of evidence is evidence of absence". So if a position doesn't have any base modification calls (using the ? mode, see page 7) it will not fall back to considering the base canonical. You're use of --mod-threshold and --filter-threshold looks correct if you want to cause all but the most confidently called modified bases to be called as canonical.

Sounds to me like you have a strong prior belief that most of the bases are canonical, so why not model that explicitly? For example, you could set a prior on the fraction of reads that will call a position as modified as $\text{Beta}(\alpha, \beta) = \text{Beta}(0.1, 0.5)$, such that the prior mean is 16.6% (of course you can adjust this as you see fit). You can use the modkit pileup output to generate the posterior distribution

$$\mu_{\text{posterior}} = \frac{\alpha + N_{\text{mod}}}{\alpha + N_{\text{mod}} + \beta + N_{\text{canonical}}}$$

If a position has no base modification probabilities then the posterior mean is equal to the prior mean (which is more likely canonical then modified).

I appreciate my suggestion doesn't bring the two programs into congruency, however I think it captures the essence of what you're trying to do here.

You may want to follow up on why some positions are missing base modification probabilities. There are a couple reasons I can think of:

  1. The read has an mismatch at the potentially modified base, e.g. an A->G mismatch will not contain a 6mA probability. I don't know if you are using "reference anchored" calls or normal basecall-anchored ones though.
  2. Some CGs could be truncated at the ends of reads, e.g. CG in the reference is C-END in the read. I doubt this happens enough to account for all the missing calls, however.

Hope this helps, happy to answer any additional questions.

@OberonDixon
Copy link
Author

Thank you for the detailed response, this is helpful.

I have proven to my satisfaction that the differences between my prior processing code and the modkit implementation is small enough that I can move forward using modkit for now.

Regarding the lack of modification calls for some CGs, our pipeline does use reference anchoring (i.e. the base call is always that from the reference, and we just call modifications) but I do not actually know how the basecaller handles cases when the base is very much not the expected canonical base from the reference, I would need to read more documentation. It's possible the modification call may be left empty in that case, even though the expected canonical base is reported by definition.

When I was examining the missing C methylation probabilities for CGs they weren't at the beginning or end of reads, and moreover our basecaller should call 5mC in any context.

So it remains a bit mysterious to me at the moment, but will not be a high priority currently. I am satisfied that I know how modkit does the processing and I think the assumptions are reasonable. As you say, if I want I can set some prior for bases with no modification call reported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants