Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform mean of probability for each read position with modkit extract #163

Open
OceaneMion opened this issue Apr 16, 2024 · 6 comments
Open
Labels
question Looking for clarification on inputs and/or outputs

Comments

@OceaneMion
Copy link

Hi all,
I would like to know if it is possible to write some bash code after performing a modkit extract, to filter based on ref_position and chrom. For example if I have a base that have the same chromosome and ref_position multiple time, let's say contig_1 for chrom associated with ref position 2 appear 4 times, how can I do a new table which will not contain the four line but only one with the mean average of the call_prob ? Also if the call code is different it should not do the mean so it should also verify this criteria. I'm not interested in the other column of the table as I only want to represent the call_prob vs ref_position.
I am kinda new to bioinformatic so any help would be appreciated !
Thanks in advance

@ArtRand
Copy link
Contributor

ArtRand commented Apr 16, 2024

Hello @OceaneMion,

I think what you want is to use modkit pileup, find the documentation online. If you need to measure the correspondence between two reference positions this table is what you want.

Maybe I'm not quite understanding what you need, could you provide a concrete example?

@ArtRand ArtRand added the question Looking for clarification on inputs and/or outputs label Apr 16, 2024
@OceaneMion
Copy link
Author

Thank you for your awnser, yes I think pileup is more suitable indeed. But I don't exactly understand how pileup works ? This is an example of the first lines of my output :

contig_1 2 3 h 411 . 2 3 255,0,0 411 12.90 53 16 342 0 0 0 3
contig_1 2 3 m 411 . 2 3 255,0,0 411 83.21 342 16 53 0 0 0 3
contig_1 4 5 h 433 . 4 5 255,0,0 433 13.16 57 14 362 1 0 5 11
contig_1 4 5 m 433 . 4 5 255,0,0 433 83.60 362 14 57 1 0 5 11

Is the column 11 : fraction modified, the percentage to have a methylation mark at this position ? Also I don't really understand the link with the mod_qual from modkit extract. I know that modkit extract will give me the probability of a base at a specific position in a read, but sometime I have reads that have different id but same ref position and chromosome so I would like to perform a mean on those.

@OceaneMion
Copy link
Author

I may have another question, is the bedgraph 4th column giving me the probability of the methylation at a specific position it look like this :
contig_1 2 3 0.8321168 411
contig_1 4 5 0.83602774 433
contig_1 47 48 0.5102041 441
contig_1 94 95 0.75 132
And on what is based this probability ? Does it uses all the reads that had the same positions for a specific base to calculate it ?

@ArtRand
Copy link
Contributor

ArtRand commented Apr 16, 2024

Hello @OceaneMion,

During pileup, each read's base modification probability (mod_qual) is converted into a base modification call, i.e. a decision as to the state of the base as modified or canonical. Those calls are then aggregated at each genome position and grouped by the modification code. That's why you'll have a row for h and m, each has the percentage of reads calling this code. The details on the columns are in the documentation. The bedgraph output has the following schema:

column name description type
1 chrom name of reference sequence from BAM header str
2 start position 0-based start position int
3 end position 0-based exclusive end position int
4 fraction modified Nmod / Nvalid_cov float
5 Nvalid_cov the valid coverage. Nvalid_cov = Nmod + Nother_mod + Ncanonical int

These values are the same as the corresponding ones in the bedMethyl output.

There are details and examples on how the pass thresholds effect the base modification calls in the documentation as well. I believe what you want is the fraction_modified value.

@OceaneMion
Copy link
Author

Thanks a lot ! So if I want to plot the probability distribution of each modified bases at each genomic position, I will need to use the fraction_modified values right ? or do I need something else?

@ArtRand
Copy link
Contributor

ArtRand commented Apr 18, 2024

Hello @OceaneMion,

Basically, yes. If you have a model that predicts more than one modification at a base (e.g. 5hmC/5mC at cytosine bases), you'll have a categorical distribution where the empirical $p_{\text{mod}}$ is equal to the fraction of modification for that mod and $p_{\text{canonical}} = 1 - \sum_{\text{mod}}p_{\text{mod}}$. So taking your previous example:

contig_1 2 3 h 411 . 2 3 255,0,0 411 12.90 53 16 342 0 0 0 3
contig_1 2 3 m 411 . 2 3 255,0,0 411 83.21 342 16 53 0 0 0 3`

You'd have:

$$ p_{\text{h}} = 0.129 $$

$$ p_{\text{m}} = 0.8321 $$

$$ p_{\text{canonical}} = 1 - (0.8321 + 0.129) = 0.0389 $$

These probabilities define the categorical distribution at that site. There are, of course, a multitude of fancier things you can do, but this is a good place to start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

2 participants