Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some warnings about rechunking to the docs #6569

Merged
merged 6 commits into from
May 10, 2022
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Back to one liners
  • Loading branch information
fmaussion committed May 9, 2022
commit 5fca7b0149753b801ee2e5668e0aa791e7cd67c8
19 changes: 7 additions & 12 deletions doc/user-guide/dask.rst
Original file line number Diff line number Diff line change
Expand Up @@ -555,25 +555,20 @@ larger chunksizes.
Optimization Tips
-----------------

With analysis pipelines involving both spatial subsetting and temporal resampling, Dask performance can become very slow or memory hungry in certain cases. Here are some optimization tips we have found through experience:
With analysis pipelines involving both spatial subsetting and temporal resampling, Dask performance
can become very slow or memory hungry in certain cases. Here are some optimization tips we have found
through experience:

1. Do your spatial and temporal indexing (e.g. ``.sel()`` or ``.isel()``) early in the pipeline, especially before calling ``resample()`` or ``groupby()``. Grouping and resampling triggers some computation on all the blocks, which in theory should commute with indexing, but this optimization hasn't been implemented in Dask yet. (See `Dask issue #746 <https://github.com/dask/dask/issues/746>`_). More generally, ``groupby()`` is a costly operation and does not (yet) perform well on datasets split across multiple files (see :pull:`5734` and linked discussions there).

2. Save intermediate results to disk as a netCDF files (using ``to_netcdf()``) and then load them again with ``open_dataset()`` for further computations. For example, if subtracting temporal mean from a dataset, save the temporal mean to disk before subtracting. Again, in theory, Dask should be able to do the computation in a streaming fashion, but in practice this is a fail case for the Dask scheduler, because it tries to keep every chunk of an array that it computes in memory. (See `Dask issue #874 <https://github.com/dask/dask/issues/874>`_)

3. Specify smaller chunks across space when using :py:meth:`~xarray.open_mfdataset`
(e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier,
because there's no risk you will load subsets of data which span multiple chunks. On individual
files, prefer to subset before chunking (suggestion 1).
3. Specify smaller chunks across space when using :py:meth:`~xarray.open_mfdataset` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load subsets of data which span multiple chunks. On individual files, prefer to subset before chunking (suggestion 1).

4. Chunk as early as possible, and avoid rechunking as much as possible. Always
pass the ``chunks={}`` argument to :py:func:`~xarray.open_mfdataset` to avoid
redundant file reads.
4. Chunk as early as possible, and avoid rechunking as much as possible. Always pass the ``chunks={}`` argument to :py:func:`~xarray.open_mfdataset` to avoid redundant file reads.

5. Using the h5netcdf package by passing ``engine='h5netcdf'`` to :py:meth:`~xarray.open_mfdataset`
can be quicker than the default ``engine='netcdf4'`` that uses the netCDF4 package.
5. Using the h5netcdf package by passing ``engine='h5netcdf'`` to :py:meth:`~xarray.open_mfdataset` can be quicker than the default ``engine='netcdf4'`` that uses the netCDF4 package.

6. Some dask-specific tips may be found `here <https://docs.dask.org/en/latest/array-best-practices.html>`_.

7. The dask `diagnostics <https://docs.dask.org/en/latest/understanding-performance.html>`_ can be
useful in identifying performance bottlenecks.
7. The dask `diagnostics <https://docs.dask.org/en/latest/understanding-performance.html>`_ can be useful in identifying performance bottlenecks.