How does Modkit handle Large Genome Data? #190

Yang990-sys · 2024-05-27T08:36:15Z

Hello,
I am using modkit to study human methylation. However, the average size of a bed file containing three types of methylation is 300G, which is too large to be analyzed by my process, And in bedfiles, most methylation fractions are 0, Causing inconvenience to subsequent analysis. I am wondering if it is possible to delete all rows with a methylation fraction of 0; And when calculating DMR, the default methylation fraction for unmeasured positions is 0?
I mainly use two programs: dmr pair and find motifs; May I ask if deleting all 0 rows will have an impact on it?

Yang990-sys · 2024-05-27T09:28:51Z

And for large genomes, memory usage is also very scary, the peak memory usage of modkit pileup reached 150G, and modkit find-motifs has exploded a server with 500GB of memory. Is there any optimization for this aspect in the later stage?

Yang990-sys · 2024-05-28T03:21:12Z

I have read the documention of Performance Consideration；but it has almost no effect on my doubts. For 300GB input files, the memory has already exploded before seed searching step

Yang990-sys · 2024-05-28T03:21:59Z

I have read the documention of Performance Consideration；but it has almost no effect on my doubts. For 300GB input files, the memory has already exploded before seed searching step

Yang990-sys · 2024-05-28T03:50:37Z

I have read the documention of Performance Consideration；but it has almost no effect on my doubts. For 300GB input files, the memory has already exploded before seed searching step

ArtRand · 2024-05-28T15:51:47Z

Hello @Yang990-sys,

May I ask if deleting all 0 rows will have an impact on it?

For modkit dmr pair removing bedMethyl records with 0% modification will not yield correct results. If you do, in the case where both conditions have 0% methylation will not be processed at all, and you will get no output for these bases. In the case where the two conditions differ (say one condition has 100% modification and the other has 0%), the algorithm in DMR will not assume that when you remove the records with 0% methylation implicitly means that they are canonical. It will see that there is no data to compare to and emit no output. Do the majority of the records have very low $N_{\text{valid}}$ ? If so you could remove records with low coverage by filtering the data through a pipe before writing it down to the filesystem:

modkit pileup ${modbam} - | awk '$5>5' | bgzip > ${out_filt_bedmethyl}

I think a better option is to partition the analysis into genomic regions, for example chromosomes or Mbp-long regions. Differential methylation works on a genomic "column", so you can process each chromosome (or an interval of a chromosome) separately then combine the results together. You can also pipe the output of modkit pileup directly into bgzip to save space writing down the table.

For modkit find-motifs the answer is a little more tricky, currently the algorithm needs to load the entire bedmethyl table. I'll need to perform some experiments to see if and how I can remove this requirement when working with very large bedMethyl files. A couple things you could try in the mean time:

Make the --context-size smaller, the default is (12, 12) maybe try (8, 8)
Make sure the --min-coverage is sufficiently high (this applies to DMR also as I mentioned).

And for large genomes, memory usage is also very scary, the peak memory usage of modkit pileup reached 150G, and modkit find-motifs has exploded a server with 500GB of memory. Is there any optimization for this aspect in the later stage?

How large is the genome you're using (you previously mentioned studying human methylation). I am working on decreasing the memory usage (and increasing the processing speed) of pileup, however as I mentioned decreasing memory usage for find-motifs requires a few experiments on my side.

Yang990-sys changed the title ~~How does dmr pair handle missing values?~~ How does Modkit handle Large Genome Data? May 28, 2024

Yang990-sys closed this as completed May 28, 2024

Yang990-sys reopened this May 28, 2024

ArtRand added question Further information is requested troubleshooting workflow and data preparation questions labels May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does Modkit handle Large Genome Data? #190

How does Modkit handle Large Genome Data? #190

Yang990-sys commented May 27, 2024

Yang990-sys commented May 27, 2024 •

edited

Loading

Yang990-sys commented May 28, 2024

Yang990-sys commented May 28, 2024

Yang990-sys commented May 28, 2024

ArtRand commented May 28, 2024 •

edited

Loading

How does Modkit handle Large Genome Data? #190

How does Modkit handle Large Genome Data? #190

Comments

Yang990-sys commented May 27, 2024

Yang990-sys commented May 27, 2024 • edited Loading

Yang990-sys commented May 28, 2024

Yang990-sys commented May 28, 2024

Yang990-sys commented May 28, 2024

ArtRand commented May 28, 2024 • edited Loading

Yang990-sys commented May 27, 2024 •

edited

Loading

ArtRand commented May 28, 2024 •

edited

Loading