Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does Modkit handle Large Genome Data? #190

Open
Yang990-sys opened this issue May 27, 2024 · 5 comments
Open

How does Modkit handle Large Genome Data? #190

Yang990-sys opened this issue May 27, 2024 · 5 comments
Labels
question Further information is requested troubleshooting workflow and data preparation questions

Comments

@Yang990-sys
Copy link

Hello,
I am using modkit to study human methylation. However, the average size of a bed file containing three types of methylation is 300G, which is too large to be analyzed by my process, And in bedfiles, most methylation fractions are 0, Causing inconvenience to subsequent analysis. I am wondering if it is possible to delete all rows with a methylation fraction of 0; And when calculating DMR, the default methylation fraction for unmeasured positions is 0?
I mainly use two programs: dmr pair and find motifs; May I ask if deleting all 0 rows will have an impact on it?

@Yang990-sys
Copy link
Author

Yang990-sys commented May 27, 2024

And for large genomes, memory usage is also very scary, the peak memory usage of modkit pileup reached 150G, and modkit find-motifs has exploded a server with 500GB of memory. Is there any optimization for this aspect in the later stage?

@Yang990-sys Yang990-sys changed the title How does dmr pair handle missing values? How does Modkit handle Large Genome Data? May 28, 2024
@Yang990-sys Yang990-sys reopened this May 28, 2024
@Yang990-sys
Copy link
Author

I have read the documention of Performance Consideration;but it has almost no effect on my doubts. For 300GB input files, the memory has already exploded before seed searching step

2 similar comments
@Yang990-sys
Copy link
Author

I have read the documention of Performance Consideration;but it has almost no effect on my doubts. For 300GB input files, the memory has already exploded before seed searching step

@Yang990-sys
Copy link
Author

I have read the documention of Performance Consideration;but it has almost no effect on my doubts. For 300GB input files, the memory has already exploded before seed searching step

@ArtRand
Copy link
Contributor

ArtRand commented May 28, 2024

Hello @Yang990-sys,

May I ask if deleting all 0 rows will have an impact on it?

For modkit dmr pair removing bedMethyl records with 0% modification will not yield correct results. If you do, in the case where both conditions have 0% methylation will not be processed at all, and you will get no output for these bases. In the case where the two conditions differ (say one condition has 100% modification and the other has 0%), the algorithm in DMR will not assume that when you remove the records with 0% methylation implicitly means that they are canonical. It will see that there is no data to compare to and emit no output. Do the majority of the records have very low $N_{\text{valid}}$ ? If so you could remove records with low coverage by filtering the data through a pipe before writing it down to the filesystem:

modkit pileup ${modbam} - | awk '$5>5' | bgzip > ${out_filt_bedmethyl}

I think a better option is to partition the analysis into genomic regions, for example chromosomes or Mbp-long regions. Differential methylation works on a genomic "column", so you can process each chromosome (or an interval of a chromosome) separately then combine the results together. You can also pipe the output of modkit pileup directly into bgzip to save space writing down the table.

For modkit find-motifs the answer is a little more tricky, currently the algorithm needs to load the entire bedmethyl table. I'll need to perform some experiments to see if and how I can remove this requirement when working with very large bedMethyl files. A couple things you could try in the mean time:

  • Make the --context-size smaller, the default is (12, 12) maybe try (8, 8)
  • Make sure the --min-coverage is sufficiently high (this applies to DMR also as I mentioned).

And for large genomes, memory usage is also very scary, the peak memory usage of modkit pileup reached 150G, and modkit find-motifs has exploded a server with 500GB of memory. Is there any optimization for this aspect in the later stage?

How large is the genome you're using (you previously mentioned studying human methylation). I am working on decreasing the memory usage (and increasing the processing speed) of pileup, however as I mentioned decreasing memory usage for find-motifs requires a few experiments on my side.

@ArtRand ArtRand added question Further information is requested troubleshooting workflow and data preparation questions labels May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested troubleshooting workflow and data preparation questions
Projects
None yet
Development

No branches or pull requests

2 participants