Add some warnings about rechunking to the docs #6569

fmaussion · 2022-05-03T16:48:02Z

This adds some warnings at the right places when rechunking a dataset opened with open_mfdataset (see pangeo-data/rechunker#100 (comment) for context)

Thanks to @dcherian for the wisdom of the day!

max-sixty · 2022-05-03T17:21:16Z

Thanks @fmaussion !

TomNicholas · 2022-05-04T14:28:35Z

doc/user-guide/dask.rst


 1. Do your spatial and temporal indexing (e.g. ``.sel()`` or ``.isel()``) early in the pipeline, especially before calling ``resample()`` or ``groupby()``. Grouping and resampling triggers some computation on all the blocks, which in theory should commute with indexing, but this optimization hasn't been implemented in Dask yet. (See `Dask issue #746 <https://github.com/dask/dask/issues/746>`_).

 2. Save intermediate results to disk as a netCDF files (using ``to_netcdf()``) and then load them again with ``open_dataset()`` for further computations. For example, if subtracting temporal mean from a dataset, save the temporal mean to disk before subtracting. Again, in theory, Dask should be able to do the computation in a streaming fashion, but in practice this is a fail case for the Dask scheduler, because it tries to keep every chunk of an array that it computes in memory. (See `Dask issue #874 <https://github.com/dask/dask/issues/874>`_)

-3. Specify smaller chunks across space when using :py:meth:`~xarray.open_mfdataset` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks (probably not necessary if you follow suggestion 1).
+3. Specify smaller chunks across space when using :py:meth:`~xarray.open_mfdataset` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks.


"chunks of data referring to different chunks" is kinda confusing, how about "subsets of data which span multiple chunks"? Or is that not the intended meaning?

Yeah I'm also not sure what was meant here (this is like this in the docs).

"subsets of data which span multiple chunks" sounds much better.

I have a related question though. When one opens a single netcdf file, is it better to:

subset first (data still not loaded), then chunk (convert the lazily loaded arrays from xarray to dask arrays)?

chunk first (at call of with ds.chunk), then subset?

subset first to reduce memory; otherwise the chunk will get loaded into memory, and then values will be discarded.

fmaussion · 2022-05-08T14:37:07Z

I've edited a few more sentences, to me this is ready to merge!

I've been struggling with groupby a good deal of my week and I added a warning regarding groupby as well. Feel free to disagree, but I've been unable to get it to work on a large datasets across multiple files yet (see pangeo discourse post).

doc/user-guide/dask.rst

max-sixty · 2022-05-10T05:54:12Z

Thanks as ever @fmaussion !

* main: (24 commits) Fix overflow issue in decode_cf_datetime for dtypes <= np.uint32 (pydata#6598) Enable flox in GroupBy and resample (pydata#5734) Add setuptools as dependency in ASV benchmark CI (pydata#6609) change polyval dim ordering (pydata#6601) re-add timedelta support for polyval (pydata#6599) Minor Dataset.map docstr clarification (pydata#6595) New inline_array kwarg for open_dataset (pydata#6566) Fix polyval overloads (pydata#6593) Restore old MultiIndex dropping behaviour (pydata#6592) [docs] add Dataset.assign_coords example (pydata#6336) (pydata#6558) Fix zarr append dtype checks (pydata#6476) Add missing space in exception message (pydata#6590) Doc Link to accessors list in extending-xarray.rst (pydata#6587) Fix Dataset/DataArray.isel with drop=True and scalar DataArray indexes (pydata#6579) Add some warnings about rechunking to the docs (pydata#6569) [pre-commit.ci] pre-commit autoupdate (pydata#6584) terminology.rst: fix link to Unidata's "netcdf_dataset_components" (pydata#6583) Allow string formatting of scalar DataArrays (pydata#5981) Fix mypy issues & reenable in tests (pydata#6581) polyval: Use Horner's algorithm + support chunked inputs (pydata#6548) ...

commit 398f1b6 Author: dcherian <[email protected]> Date: Fri May 20 08:47:56 2022 -0600 Backward compatibility dask commit bde40e4 Merge: 0783df3 4cae8d0 Author: dcherian <[email protected]> Date: Fri May 20 07:54:48 2022 -0600 Merge branch 'main' into dask-datetime-to-numeric * main: concatenate docs style (pydata#6621) Typing for open_dataset/array/mfdataset and to_netcdf/zarr (pydata#6612) {full,zeros,ones}_like typing (pydata#6611) commit 0783df3 Merge: 5cff4f1 8de7061 Author: dcherian <[email protected]> Date: Sun May 15 21:03:50 2022 -0600 Merge branch 'main' into dask-datetime-to-numeric * main: (24 commits) Fix overflow issue in decode_cf_datetime for dtypes <= np.uint32 (pydata#6598) Enable flox in GroupBy and resample (pydata#5734) Add setuptools as dependency in ASV benchmark CI (pydata#6609) change polyval dim ordering (pydata#6601) re-add timedelta support for polyval (pydata#6599) Minor Dataset.map docstr clarification (pydata#6595) New inline_array kwarg for open_dataset (pydata#6566) Fix polyval overloads (pydata#6593) Restore old MultiIndex dropping behaviour (pydata#6592) [docs] add Dataset.assign_coords example (pydata#6336) (pydata#6558) Fix zarr append dtype checks (pydata#6476) Add missing space in exception message (pydata#6590) Doc Link to accessors list in extending-xarray.rst (pydata#6587) Fix Dataset/DataArray.isel with drop=True and scalar DataArray indexes (pydata#6579) Add some warnings about rechunking to the docs (pydata#6569) [pre-commit.ci] pre-commit autoupdate (pydata#6584) terminology.rst: fix link to Unidata's "netcdf_dataset_components" (pydata#6583) Allow string formatting of scalar DataArrays (pydata#5981) Fix mypy issues & reenable in tests (pydata#6581) polyval: Use Horner's algorithm + support chunked inputs (pydata#6548) ... commit 5cff4f1 Merge: dfe200d 6144c61 Author: Maximilian Roos <[email protected]> Date: Sun May 1 15:16:33 2022 -0700 Merge branch 'main' into dask-datetime-to-numeric commit dfe200d Author: dcherian <[email protected]> Date: Sun May 1 11:04:03 2022 -0600 Minor cleanup commit 35ed378 Author: dcherian <[email protected]> Date: Sun May 1 10:57:36 2022 -0600 Support dask arrays in datetime_to_numeric

fmaussion added 2 commits May 3, 2022 18:41

Dask doc changes

1b4c2b1

small change

5e50328

max-sixty added the plan to merge Final call for comments label May 3, 2022

TomNicholas reviewed May 4, 2022

View reviewed changes

More edits

0ea4f27

max-sixty reviewed May 8, 2022

View reviewed changes

doc/user-guide/dask.rst Outdated Show resolved Hide resolved

Update doc/user-guide/dask.rst

6990cff

max-sixty reviewed May 8, 2022

View reviewed changes

doc/user-guide/dask.rst Outdated Show resolved Hide resolved

max-sixty and others added 2 commits May 8, 2022 15:36

Update doc/user-guide/dask.rst

526f5ad

Back to one liners

5fca7b0

dcherian approved these changes May 9, 2022

View reviewed changes

max-sixty merged commit 218e77a into pydata:main May 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add some warnings about rechunking to the docs #6569

Add some warnings about rechunking to the docs #6569

fmaussion commented May 3, 2022

max-sixty commented May 3, 2022

TomNicholas May 4, 2022

fmaussion May 4, 2022

dcherian May 4, 2022

fmaussion commented May 8, 2022

max-sixty commented May 10, 2022

Add some warnings about rechunking to the docs #6569

Add some warnings about rechunking to the docs #6569

Conversation

fmaussion commented May 3, 2022

max-sixty commented May 3, 2022

TomNicholas May 4, 2022

Choose a reason for hiding this comment

fmaussion May 4, 2022

Choose a reason for hiding this comment

dcherian May 4, 2022

Choose a reason for hiding this comment

fmaussion commented May 8, 2022

max-sixty commented May 10, 2022