Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large q-values #19

Closed
RoseString opened this issue Jan 21, 2019 · 6 comments
Closed

Large q-values #19

RoseString opened this issue Jan 21, 2019 · 6 comments

Comments

@RoseString
Copy link

Dear dmrseq developers,

Thank you for such a great program! My experience with dmrseq has been very good. There are two biological conditions in my study which differ by a region in a chromosome. Previous differential-expression analysis has established that the vast majority of differentially expressed genes between the two conditions are located in that region (~100-200 DEGs by DESeq2 using FDR < 0.05). I have compared results from dmrseq to results from other software using WGBS data of ~8 samples per group (sequencing depth > 10), and dmrseq outperforms all of them. That is, most top DMRs identified by dmrseq are from that region, which is consistent with the DEG analysis. On the contrary, other software gave me a lot of false positives.

However, I'm having a problem with large q-values in my results. When I used q-value criteria, there were no significant results under FDR < 0.05 and very few under 0.1 (or even 0.2), even though many top DMRs had very low p-values (1e-7 ~ 1e-5). For example, below is the top one DMR in my list (20 permutations, cutoff = 0.05, and the default settings for other parameters). The FDR is 0.08 , but the p-value is 6-06 (area = 6.44; beta = 0.87). Increasing the number of permutations to 50 did not improve the situation.

screen shot 2019-01-21 at 11 49 41 am

I read the dmrseq paper, and I think I might understand why q-values are relatively large. If I understand correctly, dmrseq first picks "candidate regions" that have smoothed methylation differences greater than a cutoff (by the default setting, it is 10%; and in my case, it is 5%). Then a test statistic is calculated per candidate, and an empirical p-value is computed by comparing to the null distribution generated by permutations. The p-values are then adjusted by FDR. My concern is that regions with relatively big effect sizes are pre-selected, and FDR is performed only on these regions. This seems analogous to selecting genes with big expression differences and computing FDRs on p-values of these genes instead on p-values of all genes, which would result in larger q-values. Below are the distributions of p-values and q-values of all regions reported by dmrseq.

image

I have tried using a lower cutoff (like 1%) to generate more regions with smaller effect sizes. This indeed gave me a few dmrs with FDR < 0.05, but also increased the number of false positives. So I'm not sure this is the right thing to do. My current solution is to use p-values cutoffs (such as 0.001) to define DMRs, but I don't think reviewers would like that.

I'm not a statistician, so I may have made some mistakes when describing this issue. Hopefully, it makes some sense? If you know some parameters that I can change to improve DMR identifications with q-values, that would be great. Thank you very much in advance!

@kdkorthauer
Copy link
Owner

Hi @RoseString,

Thanks for reaching out. You are correct in that the dmrseq procedure selects regions with large effect sizes. However, when evaluating FDR it also does so for random permutations of the data. So the most promising regions observed in the data are compared to the most promising null regions to evaluate their significance and calculate valid p-values.

Your concern is that you expect more regions with smaller FDR - looking at the distribution of your p-values, I would not expect many highly significant regions (based on a lack of a large spike of low p-values). However, the 'hill' shape in the center is unexpected. Are there any sample covariates that it might make sense to adjust for?

Another thing I'll note is that what can sometimes happen is that too few null candidate regions are found, which reduces the resolution of the FDR estimate (making it too conservative). In that case, I'd recommend some of the things you have tried: decreasing the cutoff threshold and increasing the number of permutations. I'd recommend trying a cutoff threshold of something like 2.5% instead of 5%. When you say you observed more false positives, how was this determined? The FDR should still be controlled with lower cutoff values.

In addition, are you restricting to the region in the chromosome you describe in the first paragraph? Or are you analyzing the entire genome using all of the WGBS data? dmrseq was designed for the latter, so it will require some adaptation to apply to only regions of interest.

Best,
Keegan

@RoseString
Copy link
Author

Hi @kdkorthauer,

Thank you very much for your prompt reply! I'm now understanding your approach to compute q-values better.

I am actually using the whole genome to analyze all the WGBS data instead of just using that region (10% of the genome), so this should not be a problem. However, the genome assembly is at the scaffold level (thousands of scaffolds). For the dmrseq run, I treated each scaffold as a chromosome during the permutation step. I'm not sure if this could affect results?

As for the p-value distribution, my previous figure is probably misleading. I have redrawn the distribution using a larger "break" value in R. The spike of low p-values (p < 0.001, around 60 DMRs) seems clearer.
image

The age of samples (immature vs. adults) has already been adjusted by matchCovariate, and this is the only covariate that we are aware of for these samples. As the animals were captured in the wild, we don't really have the accurate age of samples (i.e. how many days since birth). Maybe this could contribute to the hill-shaped distribution of p-values?

I will certainly try other cutoff values such as 2.5% and more permutations. Thanks again!!

@RoseString
Copy link
Author

Dear @kdkorthauer ,

I have a question regarding a sentence in your comment:

In addition, are you restricting to the region in the chromosome you describe in the first paragraph? Or are you analyzing the entire genome using all of the WGBS data? dmrseq was designed for the latter, so it will require some adaptation to apply to only regions of interest.

I'm actually starting to look at allele-specific methylation which is limited to the specific region. What is "some adaptation" that I could do?

Thanks!

@kdkorthauer
Copy link
Owner

Hi @RoseString,

My comment is regarding the fact that the default parameters for dmrseq were selected with WGBS analysis in mind. If instead you are analyzing targeted data, such as RRBS, or restricting WGBS data to a collection of regions of interest, then it may be necessary to alter these parameters. I haven't explored this area enough to have concrete recommendations, but see issue #14 for more detail/discussion.

When you say you are limited to "the specific region", are you really only looking at one region? If so, then dmrseq is not applicable. It is designed to detect regions of differential methylation out of a large number of possible regions.

Very interesting to look at allele-specific methylation! I'm curious how you are hoping to use dmrseq to examine allele-specific methylation. What do you use to obtain allele-specific methylation counts? What is the differential comparison you are interested in?

Best,
Keegan

@RoseString
Copy link
Author

Hi @kdkorthauer ,

Thanks for your suggestions! The region that I was talking about is actually ~100Mb long, which differs between a pair of chromosomes due to lack of recombination. We used population WGS data to identify fixed differences between the two alleles (1% genomic differences), and used WGBS data to obtain methylation values for each one (SNPsplit + Bismark).

At the whole-region level (majority is intergenic region), the two alleles have relatively similar methylation values (one allele is hypermethylated relative to the other, but the average difference is small -- < 5%). However, as half of the total number of genes from this large region are differentially expressed, we expect to find many differentially methylated regions in regulatory sequences.

For now, dmrseq was run with 25 permutations and 0.025 as the cutoff for methylation differences (other parameters held as the default). I obtained ~650 DMRs under Q < 0.05. Based on my preliminary results, these DMRs are significantly enriched in regulatory regions compared to null regions with similar GC content, so the results seem very promising.

I was worried that dmrseq may not be very applicable to my case, because methylation data are limited to a 100Mb region instead of the whole genome (1Gb). But based on your comment:

It is designed to detect regions of differential methylation out of a large number of possible regions.

it might be OK because this region is large enough?

Thanks again!

@kdkorthauer
Copy link
Owner

kdkorthauer commented Mar 1, 2019

I think 100Mb actually seems plenty long enough. I've applied it to single chromosomes with success. And if it is one contiguous region, you should be OK with the default settings (rather than adapting the settings as I was mentioning for RRBS / sets of many disjoint targeted regions).

Great to hear that you are detecting interesting regions! Thanks for pointing me to the SNPsplit tool, as well.

I'm going to go ahead and close this issue, as it seems like your main questions are answered for now, but please don't hesitate to reach out if anything else arises (either here in this thread or in a new issue).

Best,
Keegan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants