Performance considerations

Sharding a large modBAM by region.

The --region option in pileup, summary, and sample-probs can be used to operate on a subset of records in a large BAM. If you're working in a distributed environment, the genome could be sharded into large sections which are specified to modkit in concurrent processes and merged afterward in a "map-reduce" pattern.

Setting the --interval-size and --chunk-size (pileup).

Whenever operating on a sorted, indexed BAM, modkit will operate in parallel on disjoint spans of the genome. The length of these spans (i.e. intervals) can be determined by the --interval-size or the --sampling-interval-size (for the sampling algorithm only). The defaults for these parameters works well for genomes such as the human genome. For smaller genomes with high coverage, you may decide to decrease the interval size in order to take advantage of parallelism. The pileup subcommand also has a --chunk-size option that will limit the total number of intervals computed on in parallel. By default, modkit will set this parameter to be 50% larger than the number of threads. In general, this is a good setting for balancing parallelism and memory usage. Increasing the --chunk-size can increase parallelism (and decrease run time) but will consume more memory.

Memory usage in modkit extract.

Transforming reads into a table with modkit extract can produce large files (especially with long reads). Before the data can be written to disk, however, it is enqueued in memory and can potentially create a large memory burden. There are a few ways to decrease the amount of memory modkit extract will use in these cases:

  1. Lower the --queue-size, this decreased the number of batches that will be held in flight.
  2. Use --ignore-index this will force modkit extract to run a serial scan of the mod-BAM.
  3. Decrease the --interval-size, this will decrease the size of the batches.

The search algorithm takes advantage of parallelism at nearly every step and therefore hugely benefits from running with as many threads as possible (specified with --threads). This horizontal scalability is most easily seen in the secondary search step where (by default) 129536 individual "seed sequences" are evaluated for potential refinement. If you find that this search is taking a very long time (indicated by the progress bar message "<mod_code> seeds searched") you may consider one of the following:

  • Increase the --exhaustive-seed-min-log-odds parameter, this will decrease the number of seeds passed on to the refinement step (which is more computationally expensive).
  • Decrease the --exhaustive-seed-len to 2 or decrease the --context-size, this will exponentially decrease the number of seeds to be searched.

You may also decide to run --skip-search first and inspect the results.