More control over lazy data creation (chunking) #3333

TomekTrzeciak · 2019-06-18T12:39:35Z

Currently, it is not possible to control how the lazy data gets chunked. It is also not possible to change that afterwards (dask rechunk function does not change the original chunking, it only adds additional split/merge operations on top of it). While the default choice of chunking might be OK in some cases, in other it might be unsuitable and it would be useful to allow for user choice in this respect.

The text was updated successfully, but these errors were encountered:

pp-mo · 2019-06-20T16:18:15Z

Hi @TomekTrzeciak . Thanks for suggesting.

In what operations do you want to control chunking ?

You can already pass a pre-created dask array to the cube constructor, or assign it into cube.data.
We also have chunking control when saving to netcdf : iris.fileformats.netcdf.save supports a 'chunksize' keyword, so that can also appear as a kwarg in an iris.save call.

TomekTrzeciak · 2019-06-20T17:47:48Z

In what operations do you want to control chunking ?

iris.load & co.

You can already pass a pre-created dask array to the cube constructor, or assign it into cube.data.

Constructing the cube directly is not very convenient. I guess reassigning cube.data could be an option, but still rather awkward to write something like this:

cube = iris.load_cube(filepath)
cube.data = dask.array.from_array(netCDF4.Dataset(filepath)[cube.var_name], chunks=-1)

instead of just:

cube = iris.load_cube(filepath, chunks=-1)

pp-mo · 2019-06-21T09:49:30Z

Regarding load, options will depend on the source format.

The "field-based" file formats (FF, PP, GRIB) deal only in 2D fields, and they don't have any efficient access to subregions of a 2d field (i.e. the format code can only load a whole field then extract from it). Thus, the natural chunksize is the whole field, and I don't think there will ever be any practical use for chunking differently in those cases.

But I guess you are talking about netCDF ???
What we currently have there is basically determined by this code and this code .
In summary :

take a whole variable as a chunk,
.. or use file chunking if set
.. and reduce the earlier dimensions if resulting chunks are impracticably large

There is certainly scope for controlling that : for instance, the chunk reduction assumes c-order contiguity, so will be worst-case if earlier dimensions vary faster in the file.

So I think we are talking about adding a chunk-control keyword to the netcdf loader.
That could be pretty simple, as the underlying interfaces already have suitable controls in them.
Is that what you mean ?

TomekTrzeciak · 2019-06-21T10:55:58Z

So I think we are talking about adding a chunk-control keyword to the netcdf loader.
That could be pretty simple, as the underlying interfaces already have suitable controls in them.
Is that what you mean ?

Yes, I think extra keyword passed through from load api to netcdf loader would be all that's needed.

pp-mo · 2019-06-21T11:33:54Z

extra keyword passed through from load api to netcdf loader

I had a quick look. Unfortunately, there is no support for additional args/kwargs in the generic load functions, iris.load/load_cube/load_cubes/load_raw.
I guess that's because it's hard to associate controls with individual load items, unlike save where it makes sense to provide additional controls on a per-cube basis.
So, I hope this is ok for your expectations, I think it will be much harder to extend the iris.load API.

adding a chunk-control keyword to the netcdf loader

In the netcdf-specific loader, we currently have iris.fileformats.netcdf.load_cubes(filenames, callback=None).
I'd expect to add a new facility there.
How about something like ...
iris.fileformats.netcdf.load_cubes(filenames, callback=None, filepath_varname_chunks={})
Then use like

from iris.fileformats.netcdf import load_cubes as nc_load_cubes
cubes = nc_load_cubes(filename, {(filename, 'var1'): (32, -1, 16)})

Will this work for your purposes?
And generally, what does your usecase look like -- we'll be needing a testcase anyway ...
Feel free to propose something.

TomekTrzeciak · 2019-06-25T15:40:06Z

@pp-mo, exposing chunks in iris.fileformats.netcdf.load_cubes could be one way, but this api is quite low level (returns a cube generator, so one needs to be careful how this gets passed around in user code, probably best to convert to something like CubeList immediately).

I think iris.load/load_cube/load_cubes/load_raw could support additional kwargs the same way that iris.save does - the backends that don't support certain kwargs could simply ignore them. But I can see how this leads to passing arguments through several levels of APIs, so arguably not the nicest pattern.

An alternative could be to use a context manager to set/pass backend options without bloating top level APIs. I've noticed that there already exists iris.config.netcdf context manager, so that could be a possible place to add this (or another one could be added specifically for dask if that's preferable). Then, the user code could be like this:

with iris.config.netcdf.context(chunks=-1):
    cube = iris.load_cube('dataset.nc')

pp-mo · 2019-06-25T17:49:23Z

Hi @TomekTrzeciak thanks for sticking with this.

I think iris.load/load_cube/load_cubes/load_raw could support additional kwargs the same way that iris.save does

I don't think there is any serious reason to oppose additional load controls. I just thought it sounded like more trouble to get such a change agreed.
I myself certainly don't object in principle to passing keywords down from the main 'load' functions. I think that is probably the most obvious way to do it.

My concern is that, to be useful, I think we need to be able to specify chunking of individual file variables (see why below...). This means that the controls can't be expressed in terms of core Iris concepts such as cube identity, which then looks rather different to the 'save' case.
As I said...

it's hard to associate controls with individual load items, unlike save where it makes sense

I don't see any sensible way around this, as you can't easily predict what a given load will produce, or which Iris objects relate to which parts of a source file -- because Iris itself doesn't make any simple guarantees about those behaviours : If data changes, you can't reliably know beforehand how it will merge, how many cubes are returned, or in what order -- see for example #3314.

The reason I think we need a flexible control solution is that we do need to cope with large AuxCoords -- often larger than the data variables. That is exactly why we implemented lazy loading for AuxCoords. So it means we will want to control chunking of those variables too. In the near future, we also expect to be dealing with large unstructured grids, which will present the same problem.

I think it could be fine, if we can design a default behaviour that enables us to simplify the simple cases. I'm just a bit wary, as it isn't immediately obvious to me how that can work.

pp-mo · 2019-07-16T18:33:29Z

Hints of progress ?
#3357 (comment)

pp-mo · 2019-07-19T09:06:35Z

Cross-copied from #3357
I'm now trying to retrospectively separate these 2 discussions ! ...

@TomekTrzeciak Xarray has a nice idea there to accept chunks given as a mapping of dimension name to chunk size.

@pp-mo A really nice spot 💐, but it still has the same problem of being tied to the low-level file encoding, which must then be both known (to the user) and stable : In this case we need dimension names, which don't actually appear anywhere in the iris/CF data representation.
If we could use names of dimension coordinates, that would help loads with this. But I think we can only do that if we first "pre-load", then "re-load" the file. 😢

Also, technically, as the occurrence and order of dimensions will differ between variables, you might still want dimensions divided differently for different variables. But I agree that is probably an obscure case : In principle you could even have an X(r:74, t:20000, y:1500, x:2000) and a Y(x:2000, t:20000, r:74), but I think it would be really rare -- maybe a usecase in ocean data ??

pp-mo · 2019-07-19T09:50:37Z

Updated in understanding (mine, anyway) ...

occurrence and order of dimensions will differ between variables

This is true, but I think quite rare, as stated.
However, where I said above ...

the chunk reduction assumes c-order contiguity, so will be worst-case if earlier dimensions vary faster in the file.

Though that may be true for the abstract 'as_lazy_data' call, I now think that is probably not so for netcdf data.
In at least the overwhelming majority of files (all the files I've seen), you can assume C-ordering of dimension variables. And that really ought to inform chunking policy. However, the dask auto-chunking policy is agnostic on dimension order,
Which at present seems to me a good reason for not relying on it totally (retaining our own policy).

Frustratingly, I can't find a clear statement of this anywhere.
I think it is not guaranteed for HDF5 data generally, as that has a very general access model.
It might well be "a given" for data created via netcdf interface. But I can't find any clear statement of that.

rockdoc · 2019-07-30T14:04:52Z

Hi @pp-mo One possible alternative solution -- or a part-solution -- would be to support the specification, by the user, of a chunking hint. There would be a default setting, of course: the canonical one (whatever that might be). I adopted this approach in a Python utility I developed for writing multiple compressed variables to netCDF files with different chunking strategies.

Looking at the code, I can see that my utility supported the following chunking hints:

# Chunk size adjustment hints
HINT_INTEGER_MULTIPLE_OF_ONE_CHUNK = 1
HINT_DECREMENT_BY_ONE_FROM_LEFT    = 2
HINT_INCREMENT_BY_ONE_FROM_RIGHT   = 3
HINT_SET_TO_ONE_FROM_LEFT          = 4
HINT_SET_TO_MAX_FROM_RIGHT         = 5

Without digging around in the low-level code, I can't remember off the top of my head what each of these hints led to in terms of chunking policy. But that doesn't matter here; you'd obviously choose a variety of hints suitable for chunking dask arrays in different ways.

Your solution might want to fall back on a default chunking hint/policy for those cases when the user (or calling program) doesn't specify chunk sizes explicitly. It sounds like Iris is already implementing a default policy, even if it is just the dask default.

Anyways, just thought I'd throw this into the mix, although it might not be suitable for the current use-case.

(PS: If you did want to snoop around the code, I can sort that out - it's in a private MetO repo.)

pp-mo · 2019-08-21T15:11:38Z

Interested this too ? @cpelley

pp-mo · 2022-02-07T12:02:39Z

Following some more recent experiences, I'm changing my mind on this.
Although it can't be done "in Iris terms", I'm now convinced by a few examples I've seen that only user chunking can usefully fix some of the performance limitations.

My key motivating example:

a loaded cube merges multiple input files : each contains a 3d datacube, dimensions (nz, ny, nz), for a single timestep : result is a time sequence --> dimensions (nt, nz, ny, nx)
the user extracted a single vertical level --> cube[:, i_lev] --> (nt, ny, nx)
when fetched, this attempts to fetch the whole data and only then extract a single vertical level, which exceeds machine memory
when saved to a result file, this takes about 20 x longer than equivalent data of the same size

So, I now believe we really do need to enable user chunking control in such cases,
but for reasons above, it will need to be in file-specific terms.
Also for reasons above, I think it makes sense and will be needed to apply chunking controls

(a) to specific file variables, and
(b) to set chunk sizes for specific file dimensions

pp-mo · 2022-02-07T12:32:46Z

key motivating example:

I believe that #4448 is also a very similar problem, with possibly a similar solution

pp-mo · 2022-02-09T12:40:48Z

Hot news! I wrote a draft something that I'm hoping may be useable for this : #4572

pp-mo · 2022-10-27T13:45:59Z

See also : #4994 -- since xarray already offers chunking control. Though, that does not provide variable-specific control as suggested in #4572.
I'm not saying that this is the way to solve all problems in future -- we should probably still add chunking control to Iris.
But it may be a useful stopgap.

bjlittle assigned pp-mo Jun 25, 2019

pp-mo mentioned this issue Jul 15, 2019

Reading netcdf files is slow if there are unlimited dimensions #3357

Closed

bjlittle added Type: Performance Status: Decision Required Votable Feature labels Jul 16, 2019

znicholls mentioned this issue Jul 18, 2019

Monkey patch iris and speed up crunching znicholls/netcdf-scm#72

Merged

4 tasks

bjlittle mentioned this issue Jul 30, 2019

Iris save taking very long (hanging) with some netcdf files #3362

Closed

pp-mo mentioned this issue May 27, 2020

Support loader-specific args in iris load functions. #3720

Closed

trexfeathers mentioned this issue Oct 12, 2021

NetCDF chunking using extra memory with Iris 2.4 #4107

Closed

pp-mo mentioned this issue Feb 8, 2022

Dask chunking control for netcdf loading. #4572

Closed

pp-mo mentioned this issue Mar 14, 2022

Load is VERY slow for a NetCDF multi-variable file #4134

Closed

bjlittle removed the Feature: Votable label Mar 17, 2022

This was referenced Jun 29, 2022

Netcdf separate loadsave #4803

Merged

read-write Xarray with nc-dataset adapter. #4835

Closed

zklaus added the Feature: ESMValTool label Sep 30, 2022

valeriupredoi mentioned this issue Sep 30, 2022

Ordering issues labelled Feature: ESMValTool from iris ESMValGroup/ESMValCore#1738

Open

trexfeathers added the Dragon 🐉 https://github.com/orgs/SciTools/projects/19?pane=info label Jul 4, 2023

trexfeathers assigned trexfeathers, ESadek-MO and HGWright and unassigned pp-mo Jul 10, 2023

trexfeathers mentioned this issue Jul 28, 2023

Implement NetCDF loading chunk control #5398

Closed

SciTools locked and limited conversation to collaborators Jul 28, 2023

trexfeathers converted this issue into discussion #5401 Jul 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

More control over lazy data creation (chunking) #3333

More control over lazy data creation (chunking) #3333

TomekTrzeciak commented Jun 18, 2019

pp-mo commented Jun 20, 2019 •

edited

Loading

TomekTrzeciak commented Jun 20, 2019

pp-mo commented Jun 21, 2019 •

edited

Loading

TomekTrzeciak commented Jun 21, 2019

pp-mo commented Jun 21, 2019 •

edited

Loading

TomekTrzeciak commented Jun 25, 2019

pp-mo commented Jun 25, 2019 •

edited

Loading

pp-mo commented Jul 16, 2019

pp-mo commented Jul 19, 2019

pp-mo commented Jul 19, 2019 •

edited

Loading

rockdoc commented Jul 30, 2019

pp-mo commented Aug 21, 2019

pp-mo commented Feb 7, 2022 •

edited

Loading

pp-mo commented Feb 7, 2022

pp-mo commented Feb 9, 2022

pp-mo commented Oct 27, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

More control over lazy data creation (chunking) #3333

More control over lazy data creation (chunking) #3333

Comments

TomekTrzeciak commented Jun 18, 2019

pp-mo commented Jun 20, 2019 • edited Loading

TomekTrzeciak commented Jun 20, 2019

pp-mo commented Jun 21, 2019 • edited Loading

TomekTrzeciak commented Jun 21, 2019

pp-mo commented Jun 21, 2019 • edited Loading

TomekTrzeciak commented Jun 25, 2019

pp-mo commented Jun 25, 2019 • edited Loading

pp-mo commented Jul 16, 2019

pp-mo commented Jul 19, 2019

pp-mo commented Jul 19, 2019 • edited Loading

rockdoc commented Jul 30, 2019

pp-mo commented Aug 21, 2019

pp-mo commented Feb 7, 2022 • edited Loading

pp-mo commented Feb 7, 2022

pp-mo commented Feb 9, 2022

pp-mo commented Oct 27, 2022

This issue was moved to a discussion.

pp-mo commented Jun 20, 2019 •

edited

Loading

pp-mo commented Jun 21, 2019 •

edited

Loading

pp-mo commented Jun 21, 2019 •

edited

Loading

pp-mo commented Jun 25, 2019 •

edited

Loading

pp-mo commented Jul 19, 2019 •

edited

Loading

pp-mo commented Feb 7, 2022 •

edited

Loading