Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enahncement of multibamsummary performance #1138

Open
LeilyR opened this issue May 6, 2022 · 1 comment
Open

enahncement of multibamsummary performance #1138

LeilyR opened this issue May 6, 2022 · 1 comment

Comments

@LeilyR
Copy link
Contributor

LeilyR commented May 6, 2022

It is slow when the numebr of bam files increases. (It could potentially be affected by the high depth of sequencing as well.)

@adRn-s
Copy link
Collaborator

adRn-s commented May 9, 2022

I've just finished reading about Dask, h5py and other alternatives to NumPy.

At first, I thought that one of them could speed up the step of writing data to disk. According to this benchmark, I was wrong.

Probably the need to change NumPy is related then to other functions (matrix operations, algebra, etc) that should be parallelized.

Of the options evaluated, Dask seems promising, but it would entail rewriting more modules because some currently used NumPy functionalities used in deeptools are out of their scope.

Seems to me that h5py is a drop-in replacement to NumPy, it has every data type except for generic objects. Correct me if I'm wrong, we're not using dtype "O".

So, h5py could be the way to go. I will try that. Now, even if it speeds things up, it would be safe to have a more thorough test suite. Specially over those modules that rely the most on NumPy functions. Even if we don't introduce changes in them, they could be affected. Here's a list of the python modules and the count of NumPy calls (actually, the number of lines that match np\\. regex...)

❯ rg -c np\\. *.py  | awk -F ':' '{print $2 "\t" $1}' | sort -rn
86      heatmapper.py
60      plotProfile.py
47      correlation.py
46      plotFingerprint.py
45      plotHeatmap.py
38      getFragmentAndReadSize.py
22      computeMatrixOperations.py
22      computeGCBias.py
21      countReadsPerBin.py
15      SES_scaleFactor.py
15      plotEnrichment.py
14      correctGCBias.py
12      plotCoverage.py
12      heatmapper_utilities.py
11      getScorePerBigWigBin.py
[truncated]

If greater data is need for the tests, this can be sorted out with git-lfs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants