Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarifications on DMR Analysis #191

Open
Proy321 opened this issue May 27, 2024 · 14 comments
Open

Clarifications on DMR Analysis #191

Proy321 opened this issue May 27, 2024 · 14 comments
Labels
DMR modkit dmr question Looking for clarification on inputs and/or outputs

Comments

@Proy321
Copy link

Proy321 commented May 27, 2024

Hello @ArtRand
I would like to confirm that the scores obtained in the DMR are indeed calculated based on the last two columns, i.e., pct_a_samples and pct_b_samples.
Specifically, I would also like to understand the roles of the map-p value and effect size in this process. Could you please elaborate on the significance of these parameters and advise on whether we should prioritize the score, map-p value, or effect size when filtering out the most significant DMR regions?
Also. i want to specifically identify dmr in 5hmC, so during pileup, should I include a flag with "--ignore C" to focus solely on identifying DMR genes associated with 5hmc?
Your insights and guidance on these matters would be greatly appreciated as i strive to conduct a thorough and accurate analysis of our DMR data.
Looking forward to your response.

Thanks

@Proy321
Copy link
Author

Proy321 commented May 27, 2024

Also, is there any threshold involved in map-pvalue or effect size to be considered for the significant dmr

@ArtRand
Copy link
Contributor

ArtRand commented May 28, 2024

Hello @Proy321,

The score and MAP-based p-value are calculated using the sample a and sample b counts (columns 6 and 8, respectively). In the case of the MAP-based p-value, counts of all modifications are combined as described in the limitations (number 4). The columns for sample a percent modified and sample b percent modified (columns 12 and 13, respectively) are not used directly in the calculation and are for reporting purposed. The columns you are referring to, pct_a_samples and pct_b_samples (columns 18 and 19) only report the fraction of the samples used in the test. For example if you have 2 replicates for each condition (4 inputs total), but only 1 of the two samples has enough coverage at a site, this number will be 50% (1 in 2 of the samples are used). You can review issue-149 for some additional details.

Specifically, I would also like to understand the roles of the map-p value and effect size in this process. Could you please elaborate on the significance of these parameters and advise on whether we should prioritize the score, map-p value, or effect size when filtering out the most significant DMR regions?

If you are looking for generally differentially modified sites, I would use the MAP-based p-value. You can decide on a threshold depending on your experiment. I would sanity check your results with the effect size. If you want to discover changes in methylation accounting for changes in type (i.e. 5mC vs 5hmC) I would use the score. The score is quite good at ranking regions and sites by how different they are. The details for the calculations is in the documentation. If you want to find differentially methylated regions while accounting for different modification types, I'd recommend using the --segment option.

Also. i want to specifically identify dmr in 5hmC, so during pileup, should I include a flag with "--ignore C" to focus solely on identifying DMR genes associated with 5hmc?

No, --ignore C will not do what you're looking for here. There isn't a direct way to filter to regions that are differentially hydroxymethylated. What you have to do is run modkit dmr pair either with or without --segment then calculate the change in 5hmC levels by parsing column 11 and 12. Records with large changes would presumably be due to changes in 5hmC levels. There is some polars code in a previous issue that may be useful. I have some ideas of how to make this kind of filtering easier that may come out in a future release.

@ArtRand ArtRand added question Looking for clarification on inputs and/or outputs DMR modkit dmr labels May 28, 2024
@Proy321
Copy link
Author

Proy321 commented May 29, 2024

Thanks for your response.
It was really helpful to understand the DMR results.
I have a follow-up question @ArtRand regarding the --segment option
Is there a way to incorporate multiple samples in the -a and -b options in --segment option .The tsv file generated for the 5hmC (ATTACHED HERE) has combined 5hmC<->5mC calls.
image
and I am only interested to know the dmr of 5hmC.
Additionally, could you advise on which BED file I should use in the --segment option? Since I am analyzing multiple samples for(5hmC<->5mC calls) so that I can record significant changes in 5hmC levels across these samples only.
Further, the other question I have is I have set a threshold value of less than 0.05 in case of map-p value, but I can see that the effect size has some negative values reflecting while putting the threshold for map-p value, should that be considered or only the positive values should be considered in the effect size.
It would be nice to hear your comments on the same.

Thanks&Regards
Priyanka Roy

@ArtRand
Copy link
Contributor

ArtRand commented May 29, 2024

Hello @Proy321,

Is there a way to incorporate multiple samples in the -a and -b options in --segment option .

Yes, as documented, use can pass -a and -b multiple times to include multiple samples.

Currently there isn't a way to hone into 5mC<>5hmC differential methylation, but I'm planning on adding this in a future release. As I mentioned in the previous comment, the easiest way to find sites where you're seeing changes in 5hmC will be to parse the a_mod_percentages and b_mod_percentages columns. The score does incorporate 5mC<>5hmC changes, so you can look at high scoring sites or regions.

Additionally, could you advise on which BED file I should use in the --segment option?

You don't need to provide a BED file, a BED file will be produced with the segment labels. You should analyze this file the same way as the single-site one, find the "different" regions that have large changes in 5hmC.

Further, the other question I have is I have set a threshold value of less than 0.05 in case of map-p value, but I can see that the effect size has some negative values reflecting while putting the threshold for map-p value, should that be considered or only the positive values should be considered in the effect size.

A negative effect size only means that the change is a reduction in modification levels with respect to the a samples.

@PRIYANKA-22091995
Copy link

Thank you so much for your response.
@ArtRand , So, when we are reporting significant DMR regions, with a threshold of map-p value of less than 0.05, so does it make sense to report the positive values in the effect size?
Note- As my point of interest is only those DMR regions where we are seeing a significant change in the b samples(treatment samples) compared to the a samples(control samples).

Thanks

@Proy321
Copy link
Author

Proy321 commented May 30, 2024

Hello @ArtRand
Example to the above query, here is the given example, what will the interpretation for this example with respect to the effect size.
image

Thanks

@Proy321
Copy link
Author

Proy321 commented Jun 1, 2024

Hello @ArtRand
It would be nice to have your inputs on the above queries.
In addition, regarding the output received after executing the '--segment' command. Specifically, I noticed that the column 'state-name' consistently displays 'same' entries, while the 'score' column indicates high scores.
Given this observation, I am seeking guidance on how to interpret these results. Even though the 'state-name' column remains constant with 'same' entries, could the high scores in the 'score' column still indicate significant changes resulting from variations in 5hmC levels(attached screenshot)
image

I believe understanding the relationship between the 'state-name' and 'score' columns is crucial for accurately interpreting the results of my analysis.

It would be nice to have your inputs for the same.

Thanks

@PRIYANKA-22091995
Copy link

PRIYANKA-22091995 commented Jun 3, 2024

Hello @ArtRand
It would be nice to have your inputs on the above queries.

Thanks
Priyanka Roy

@ArtRand
Copy link
Contributor

ArtRand commented Jun 3, 2024

Hello @Proy321,

To answer your first question regarding the interpretation of the DMR output. Both positions (8441 and 8443) are the same. In sample a there are 3 reads at the position all of them indicate a canonical cytosine. So the 5hmC and 5mC modification percentages are 0%. Similarly, the percent modified (any modification type) is 0%. For the b sample there are 2 reads, one has a 5hmC call and one has a 5mC call. Therefore the 5mC and 5hmC percentages are 50% each and the modification percent is 100% (2 of 2).

The effect size is then the percent modified in a minus the percent modified in b, 0 - 1 = -1.

The MAP-based p-value is below the often used 0.05 number, however in my experiments at least, I don't use any positions with at least 5 reads of valid coverage for both samples.

To your second question, the score is, unfortunately, somewhat correlated with the number of potentially modified positions (CpGs in this case I believe) in the region. So large regions of "same" will have higher scores. All of the effect sizes in the table you've shown are very small, so I would not interpret a high score as evidence of any substantial change in methylation. If you want to find regions where there is evidence for differential 5hmC, I would filter to just the "different" records and look for large changes in percent 5hmC (which should also have high score).

In the next release the "score" column in the DMR output will be replaced with a probability of being in that state and there will be a "between modification" effect size that will make this analysis a little easier.

@PRIYANKA-22091995
Copy link

Thank you so much for your response @ArtRand
I have a follow-up question, Suppose, for example, I have the following data:

Column 'a' with 5 reads showing 0% modification
Column 'b' with 7 reads showing 100% modification
Given this data, the effect size would be -1. My query is whether this scenario would still be considered a differentially methylated region (DMR). Specifically, in this case, would the -1 effect size indicate that the 'b' column is differentially methylated compared to the 'a' column?

Thank you for your assistance and clarification on this matter.

Thanks

@ArtRand
Copy link
Contributor

ArtRand commented Jun 4, 2024

Hello @PRIYANKA-22091995,

If you have a site with 0 of 5 reads reporting modification (all 5 report canonical) in one condition and 7 of 7 reporting methylation (of any type) in another condition. The MAP-based p-value will be 0.00024, meaning the (posterior) probability of zero effect divided by the observed effect (abs(-1)) is 0.00024 (see the docs for a graphical explanation). As far as deciding whether this is a differentially methylated position or not, you have to consider your entire experiment to make that call. If you consistently see this difference in multiple replicates of the two conditions (and not within conditions) you can be more confident in the observed effect.

The next release will have a simple function that will calculate the MAP-based p-value for a given number of modified/canonical reads so you can explore what posterior probabilities look like.

@Proy321
Copy link
Author

Proy321 commented Jun 19, 2024

Thank you for your response.
I have a followup question on DMR @ArtRand , on the basis of result obtained on DMR , is there a way to decide hyper and hypo methylated regions based on any particular threshold or cutoff.
Looking forward to your reply.

Thanks

@Manoswini-02
Copy link

@ArtRand , @PRIYANKA-22091995

I'm also looking for 5hmC-associated-DMRs. In this regard, the modbam2bed tool offers an option (-m) to target specific modified base types. For eg, if my focus is on detecting 5hmC-related DMRs, can I reliably use the following code as recommended in the ONT community documentation (https://community.nanoporetech.com/docs/prepare/library_prep_protocols/ligation-sequencing-gdna-rrms/v/rrms_9164_v110_revd_30may2022/downstream-analysis?devices=gridion#faq-tab=). It extracts 5hmC data (or any modified base), which can then be analyzed further using tools such as DSS or similar packages.

Below is the code:

modbam2bed -m 5hmC -e --cpg -t 8 -f 0.66 \
--aggregate --prefix 5hmC_P2 \
/mnt/d/ONT_modBam/Demultiplex1/hg38.fa \
/mnt/d/ONT_modBam/Demultiplex1/P2_BAM/SQK-NBD114-24_barcode01.bam > P2_5hmC.bed

While the code mentioned is intended for processing RRMS data, it seems adaptable for other forms of Nanopore data. I understand the differences between using modbam2bed and modkit's pile-up function, as highlighted in discussions (issue #14 and #2). However, working with the 5hmC data appears straightforward using this, assuming accuracy. I compared both bed file outputs and found that most columns contain identical values, except for the read coverage from modbam2bed and Nvalidcov from modkit. This difference is expected since each of them have different calculation method.

My question is whether it's possible to exclusively extract a modified base of interest? If not, is it feasible to incorporate this functionality into modkit dmr or during the creation of the bedmethyl file using pileup?

Thanks!

@ArtRand
Copy link
Contributor

ArtRand commented Jul 11, 2024

Hello @Manoswini-02,

If you only want bedmethyl rows with 5hmC records, you can filter the output through awk:

$ modkit pileup ${mod_bam} stdout [options] | awk '$4=="h"' > 5hmC.bedmethyl

However, if you do this the output will not be suitable for modkit dmr (in fact it should throw an error). This is because if you have a modified base model that predicts 5mC/5hmC/C, you have to model all 3 together - which modkit dmr does and so it needs all of the records. Most DMR packages such as DSS are 2-way comparison models (modified/not-modified, usually 5mC/C). I'm currently working on some metrics that will help find sites where one of N modifications have significant changes - but it's not ready yet. Right now the best I can offer is to use the modkit dmr pair function (wither with --regions or without), parse the output (the annoying part), and find records with large changes in the 5hmC fraction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DMR modkit dmr question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

4 participants