Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duration of a permutation longer than previous one #42

Open
tmms1 opened this issue Mar 11, 2021 · 3 comments
Open

Duration of a permutation longer than previous one #42

tmms1 opened this issue Mar 11, 2021 · 3 comments

Comments

@tmms1
Copy link

tmms1 commented Mar 11, 2021

Dear Keegan

I am analyzing a WGBS dataset with dmrseq. The data is not human or mouse, so I used all the tips you gave here to figure out a sensible analysis (trying different parametersettings on one chromosome, plotting the DMRs, ...). I found a sensible setting and everything was working: dmrseq didn't produce any errors and was running quite fast. However, at a certain permutation, dmrseq produced warnings and took significantly longer to run. I have no idea what is causing this and how I an solve this. Could you offer any advice?

Here are some outputs so you can see the difference in running time.

This is an output of discovering the regions. The data consists of 8 samples, collected at four different locations at two different time points. At the moment, I am testing for a difference in location while adjusting for time point. I encoded both location and time as factor.
image

This is an output of a good permutation:
image

This is an output of a bad permutation:
image

Thanks in advance.

With kind regards
Tim

@kdkorthauer
Copy link
Owner

Hi Tim,

In the 'bad' permutation, the reason the runtime is so much longer is that it takes longer to fit region-specific models in regions with lots of CpGs. The sizes of the regions are depending on the data (e.g. if long stretches exist at a particular smoothing bandwidth, the regions are longer). However, I've not seen this particular situation before (long stretches not seen in data, but seen in only some permutations).

In addition to time point and location, are there any other covariates that might explain variation in methylation? What could be happening in the 'bad' permutation is that there is some other latent factor that has some association with methylation.

What is the distribution of candidate region size (number of CpGs) like? (e.g. if you set permutations to something very low to avoid having a 'bad' permutation just to get some output of candidate regions, you can check the size distribution)

Best,
Keegan

@tmms1
Copy link
Author

tmms1 commented Mar 15, 2021

Dear Keegan

Thank you for the quick answer.

I don't think there are other covariates that could explain the difference in methylation. It are samples from a certain plant. They were sampled on four different locations and at two different times (four months apart). I would even guess that the time effect is rather small, because methylation is an adaption in the long term. The model I fitted was dmrseq(..., testCovariate = "location", adjustCovariate = "time"). I think this is correct, isn't? If I understand the method correctly, "time" is included into the design matrix while calculating the region-level statistic and there are no restrictions placed on the permutations? I am not quite sure how permutations work with this design. I understand that you swap the labels of the samples, but I don't see which permutations are possible.

Here is a boxplot of the widths of the regions. I did not refit the model, I just waited until the analysis finished. It is remarkable that this warning only occurred in permutation 7 (of 10) and for all chromosomes.
image

I also plotted the distance between consecutive regions (based on coordinate). I think this plot shows that smoothing bandwidth is sensible, because you don't observe many regions that are very close to each other.
image

Thanks in advance.

With kind regards
Tim

@kdkorthauer
Copy link
Owner

Hi Tim,

Thanks for following up.

If I understand the method correctly, "time" is included into the design matrix while calculating the region-level statistic and there are no restrictions placed on the permutations? I am not quite sure how permutations work with this design. I understand that you swap the labels of the samples, but I don't see which permutations are possible.

Yes, that's correct. the labels are swapped at random, and there's no restriction on the permutations for this specification.

I must admit I'm puzzled. I agree that your metrics seem to show that the smoothing bandwidth seems sensible. So I don't have a reasonable explanation for why this 'bad' permutation exists.

If you're willing to share a small subset of your data (for example, just one chromosome, or even smaller if it generates the same result), I'd be happy to dig into it further. Let me know, and I'm happy to provide a dropbox link for upload.

Best,
Keegan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants