Perform mean of probability for each read position with modkit extract #163

OceaneMion · 2024-04-16T09:35:34Z

Hi all,
I would like to know if it is possible to write some bash code after performing a modkit extract, to filter based on ref_position and chrom. For example if I have a base that have the same chromosome and ref_position multiple time, let's say contig_1 for chrom associated with ref position 2 appear 4 times, how can I do a new table which will not contain the four line but only one with the mean average of the call_prob ? Also if the call code is different it should not do the mean so it should also verify this criteria. I'm not interested in the other column of the table as I only want to represent the call_prob vs ref_position.
I am kinda new to bioinformatic so any help would be appreciated !
Thanks in advance

ArtRand · 2024-04-16T13:27:27Z

Hello @OceaneMion,

I think what you want is to use modkit pileup, find the documentation online. If you need to measure the correspondence between two reference positions this table is what you want.

Maybe I'm not quite understanding what you need, could you provide a concrete example?

OceaneMion · 2024-04-16T14:13:08Z

Thank you for your awnser, yes I think pileup is more suitable indeed. But I don't exactly understand how pileup works ? This is an example of the first lines of my output :

contig_1 2 3 h 411 . 2 3 255,0,0 411 12.90 53 16 342 0 0 0 3
contig_1 2 3 m 411 . 2 3 255,0,0 411 83.21 342 16 53 0 0 0 3
contig_1 4 5 h 433 . 4 5 255,0,0 433 13.16 57 14 362 1 0 5 11
contig_1 4 5 m 433 . 4 5 255,0,0 433 83.60 362 14 57 1 0 5 11

Is the column 11 : fraction modified, the percentage to have a methylation mark at this position ? Also I don't really understand the link with the mod_qual from modkit extract. I know that modkit extract will give me the probability of a base at a specific position in a read, but sometime I have reads that have different id but same ref position and chromosome so I would like to perform a mean on those.

OceaneMion · 2024-04-16T15:01:56Z

I may have another question, is the bedgraph 4th column giving me the probability of the methylation at a specific position it look like this :
contig_1 2 3 0.8321168 411
contig_1 4 5 0.83602774 433
contig_1 47 48 0.5102041 441
contig_1 94 95 0.75 132
And on what is based this probability ? Does it uses all the reads that had the same positions for a specific base to calculate it ?

ArtRand · 2024-04-16T16:02:42Z

Hello @OceaneMion,

During pileup, each read's base modification probability (mod_qual) is converted into a base modification call, i.e. a decision as to the state of the base as modified or canonical. Those calls are then aggregated at each genome position and grouped by the modification code. That's why you'll have a row for h and m, each has the percentage of reads calling this code. The details on the columns are in the documentation. The bedgraph output has the following schema:

column	name	description	type
1	chrom	name of reference sequence from BAM header	str
2	start position	0-based start position	int
3	end position	0-based exclusive end position	int
4	fraction modified	N_mod / N_{valid_cov}	float
5	N_{valid_cov}	the valid coverage. N_{valid_cov} = N_mod + N_{other_mod} + N_canonical	int

These values are the same as the corresponding ones in the bedMethyl output.

There are details and examples on how the pass thresholds effect the base modification calls in the documentation as well. I believe what you want is the fraction_modified value.

OceaneMion · 2024-04-18T09:05:09Z

Thanks a lot ! So if I want to plot the probability distribution of each modified bases at each genomic position, I will need to use the fraction_modified values right ? or do I need something else?

ArtRand · 2024-04-18T15:44:40Z

Hello @OceaneMion,

Basically, yes. If you have a model that predicts more than one modification at a base (e.g. 5hmC/5mC at cytosine bases), you'll have a categorical distribution where the empirical $p_{\text{mod}}$ is equal to the fraction of modification for that mod and $p_{\text{canonical}} = 1 - \sum_{\text{mod}}p_{\text{mod}}$. So taking your previous example:

contig_1 2 3 h 411 . 2 3 255,0,0 411 12.90 53 16 342 0 0 0 3
contig_1 2 3 m 411 . 2 3 255,0,0 411 83.21 342 16 53 0 0 0 3`

You'd have:

$$ p_{\text{h}} = 0.129 $$

$$ p_{\text{m}} = 0.8321 $$

$$ p_{\text{canonical}} = 1 - (0.8321 + 0.129) = 0.0389 $$

These probabilities define the categorical distribution at that site. There are, of course, a multitude of fancier things you can do, but this is a good place to start.

ArtRand added the question Looking for clarification on inputs and/or outputs label Apr 16, 2024

ArtRand mentioned this issue Apr 18, 2024

Detect methylation data background noise #147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform mean of probability for each read position with modkit extract #163

Perform mean of probability for each read position with modkit extract #163

OceaneMion commented Apr 16, 2024

ArtRand commented Apr 16, 2024

OceaneMion commented Apr 16, 2024

OceaneMion commented Apr 16, 2024

ArtRand commented Apr 16, 2024 •

edited

Loading

OceaneMion commented Apr 18, 2024

ArtRand commented Apr 18, 2024

Perform mean of probability for each read position with modkit extract #163

Perform mean of probability for each read position with modkit extract #163

Comments

OceaneMion commented Apr 16, 2024

ArtRand commented Apr 16, 2024

OceaneMion commented Apr 16, 2024

OceaneMion commented Apr 16, 2024

ArtRand commented Apr 16, 2024 • edited Loading

OceaneMion commented Apr 18, 2024

ArtRand commented Apr 18, 2024

ArtRand commented Apr 16, 2024 •

edited

Loading