How can I speed up the execution of groupby().quantile() ? #6488

zoj613 · 2022-04-15T08:49:30Z

zoj613
Apr 15, 2022

I have been working on code with a Dataset object that has dimensions time: 93, lat: 361, lon: 720. There is a line that I found to be a major bottleneck. This is the execution:

dataset.groupby("time.year").quantile(np.linspace(0.01, 0.9, 21), dim="time")

It basically takes forever (over an hour for ONE dataset). And I have to run this line for upwards of 50 datasets whose results I have to write on disk as .zarr files.

Is there a way I can speed this up, or maybe rewrite it in a different way such that I get the same results? I have found that the only way it executes a little faster is by using interpolation="nearest" for the .quantile method.

Answered by mathause

Apr 15, 2022

skipna=False will help a bit but probably not enough.

Is it a dask array? Calling .compute() (or .load()?) before the groupby operation might also help.

93 timesteps with groupby and 21 quantiles seem very few?

View full answer

mathause · 2022-04-15T13:21:00Z

mathause
Apr 15, 2022
Maintainer

skipna=False will help a bit but probably not enough.

Is it a dask array? Calling .compute() (or .load()?) before the groupby operation might also help.

93 timesteps with groupby and 21 quantiles seem very few?

2 replies

zoj613 Apr 15, 2022
Author

skipna=False will help a bit but probably not enough.

Is it a dask array? Calling .compute() (or .load()?) before the groupby operation might also help.

93 timesteps with groupby and 21 quantiles seem very few?

Ha, setting skipna to False actually did the trick on a test dataset I just ran it on with just 8 timesteps. With skipna=True it runs in 47 seconds vs 298ms with skipna=False like you suggested. Do you have an idea why there is such a big difference in compute time? I have all values of the dataset loaded in memory so im not using dask arrays. When i did use dask arrays the compute time was even worse with all the unpredictable chunking I had to deal with.

dcherian Apr 15, 2022
Maintainer

Maybe you're running in to numpy/numpy#16575

lbferreira · 2024-01-08T21:05:31Z

lbferreira
Jan 8, 2024

I also faced some difficulties to speed up quantile calculation. In my case, using skipna=False was not an option because handling nan values was mandatory. Thus, I created a library (fastnanquantile) for it (it uses numba under the hood).

You can install it using pip:

pip install fastnanquantile

Replace your code with:

dataarray.groupby("time.year").map(xrcompat.xr_apply_nanquantile, q=np.linspace(0.01, 0.9, 21), dim="time")

Note that you need to use a DataArray instead of a Dataset.

Performance gains depends on the data shape. For time composite creation, when you typically have one small dimension (time) and two large dimensions (coordinates), you can expect much faster computation (10x faster or more). Please see this post for a brief benchmark.
More info can be found in the repo and this example notebook: https://github.com/lbferreira/fastnanquantile/blob/main/examples/example_xarray.ipynb

0 replies

zoj613 · 2024-01-23T15:11:32Z

zoj613
Jan 23, 2024
Author

Cc @maresb

0 replies

maresb · 2024-01-23T15:59:54Z

maresb
Jan 23, 2024

Thanks @lbferreira and @zoj613! This could be useful for my current work.

1 reply

lbferreira Jan 23, 2024

You're welcome. I hope it may help you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I speed up the execution of groupby().quantile() ? #6488

{{title}}

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

How can I speed up the execution of groupby().quantile() ? #6488

zoj613 Apr 15, 2022

Replies: 4 comments · 3 replies

mathause Apr 15, 2022 Maintainer

zoj613 Apr 15, 2022 Author

dcherian Apr 15, 2022 Maintainer

lbferreira Jan 8, 2024

zoj613 Jan 23, 2024 Author

maresb Jan 23, 2024

lbferreira Jan 23, 2024

zoj613
Apr 15, 2022

Replies: 4 comments 3 replies

mathause
Apr 15, 2022
Maintainer

zoj613 Apr 15, 2022
Author

dcherian Apr 15, 2022
Maintainer

lbferreira
Jan 8, 2024

zoj613
Jan 23, 2024
Author

maresb
Jan 23, 2024