Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More control over lazy data creation (chunking) #3333

Closed
TomekTrzeciak opened this issue Jun 18, 2019 · 16 comments
Closed

More control over lazy data creation (chunking) #3333

TomekTrzeciak opened this issue Jun 18, 2019 · 16 comments
Assignees
Labels
Dragon 🐉 https://github.com/orgs/SciTools/projects/19?pane=info Feature: ESMValTool Status: Decision Required Type: Performance

Comments

@TomekTrzeciak
Copy link

Currently, it is not possible to control how the lazy data gets chunked. It is also not possible to change that afterwards (dask rechunk function does not change the original chunking, it only adds additional split/merge operations on top of it). While the default choice of chunking might be OK in some cases, in other it might be unsuitable and it would be useful to allow for user choice in this respect.

@pp-mo
Copy link
Member

pp-mo commented Jun 20, 2019

Hi @TomekTrzeciak . Thanks for suggesting.

In what operations do you want to control chunking ?

You can already pass a pre-created dask array to the cube constructor, or assign it into cube.data.
We also have chunking control when saving to netcdf : iris.fileformats.netcdf.save supports a 'chunksize' keyword, so that can also appear as a kwarg in an iris.save call.

@TomekTrzeciak
Copy link
Author

In what operations do you want to control chunking ?

iris.load & co.

You can already pass a pre-created dask array to the cube constructor, or assign it into cube.data.

Constructing the cube directly is not very convenient. I guess reassigning cube.data could be an option, but still rather awkward to write something like this:

cube = iris.load_cube(filepath)
cube.data = dask.array.from_array(netCDF4.Dataset(filepath)[cube.var_name], chunks=-1)

instead of just:

cube = iris.load_cube(filepath, chunks=-1)

@pp-mo
Copy link
Member

pp-mo commented Jun 21, 2019

Regarding load, options will depend on the source format.

The "field-based" file formats (FF, PP, GRIB) deal only in 2D fields, and they don't have any efficient access to subregions of a 2d field (i.e. the format code can only load a whole field then extract from it). Thus, the natural chunksize is the whole field, and I don't think there will ever be any practical use for chunking differently in those cases.

But I guess you are talking about netCDF ???
What we currently have there is basically determined by this code and this code .
In summary :

  • take a whole variable as a chunk,
  • .. or use file chunking if set
  • .. and reduce the earlier dimensions if resulting chunks are impracticably large

There is certainly scope for controlling that : for instance, the chunk reduction assumes c-order contiguity, so will be worst-case if earlier dimensions vary faster in the file.

So I think we are talking about adding a chunk-control keyword to the netcdf loader.
That could be pretty simple, as the underlying interfaces already have suitable controls in them.
Is that what you mean ?

@TomekTrzeciak
Copy link
Author

So I think we are talking about adding a chunk-control keyword to the netcdf loader.
That could be pretty simple, as the underlying interfaces already have suitable controls in them.
Is that what you mean ?

Yes, I think extra keyword passed through from load api to netcdf loader would be all that's needed.

@pp-mo
Copy link
Member

pp-mo commented Jun 21, 2019

extra keyword passed through from load api to netcdf loader

I had a quick look. Unfortunately, there is no support for additional args/kwargs in the generic load functions, iris.load/load_cube/load_cubes/load_raw.
I guess that's because it's hard to associate controls with individual load items, unlike save where it makes sense to provide additional controls on a per-cube basis.
So, I hope this is ok for your expectations, I think it will be much harder to extend the iris.load API.

adding a chunk-control keyword to the netcdf loader

In the netcdf-specific loader, we currently have iris.fileformats.netcdf.load_cubes(filenames, callback=None).
I'd expect to add a new facility there.
How about something like ...
iris.fileformats.netcdf.load_cubes(filenames, callback=None, filepath_varname_chunks={})
Then use like

from iris.fileformats.netcdf import load_cubes as nc_load_cubes
cubes = nc_load_cubes(filename, {(filename, 'var1'): (32, -1, 16)})

Will this work for your purposes?
And generally, what does your usecase look like -- we'll be needing a testcase anyway ...
Feel free to propose something.

@TomekTrzeciak
Copy link
Author

@pp-mo, exposing chunks in iris.fileformats.netcdf.load_cubes could be one way, but this api is quite low level (returns a cube generator, so one needs to be careful how this gets passed around in user code, probably best to convert to something like CubeList immediately).

I think iris.load/load_cube/load_cubes/load_raw could support additional kwargs the same way that iris.save does - the backends that don't support certain kwargs could simply ignore them. But I can see how this leads to passing arguments through several levels of APIs, so arguably not the nicest pattern.

An alternative could be to use a context manager to set/pass backend options without bloating top level APIs. I've noticed that there already exists iris.config.netcdf context manager, so that could be a possible place to add this (or another one could be added specifically for dask if that's preferable). Then, the user code could be like this:

with iris.config.netcdf.context(chunks=-1):
    cube = iris.load_cube('dataset.nc')

@pp-mo
Copy link
Member

pp-mo commented Jun 25, 2019

Hi @TomekTrzeciak thanks for sticking with this.

I think iris.load/load_cube/load_cubes/load_raw could support additional kwargs the same way that iris.save does

I don't think there is any serious reason to oppose additional load controls. I just thought it sounded like more trouble to get such a change agreed.
I myself certainly don't object in principle to passing keywords down from the main 'load' functions. I think that is probably the most obvious way to do it.

My concern is that, to be useful, I think we need to be able to specify chunking of individual file variables (see why below...). This means that the controls can't be expressed in terms of core Iris concepts such as cube identity, which then looks rather different to the 'save' case.
As I said...

it's hard to associate controls with individual load items, unlike save where it makes sense

I don't see any sensible way around this, as you can't easily predict what a given load will produce, or which Iris objects relate to which parts of a source file -- because Iris itself doesn't make any simple guarantees about those behaviours : If data changes, you can't reliably know beforehand how it will merge, how many cubes are returned, or in what order -- see for example #3314.

The reason I think we need a flexible control solution is that we do need to cope with large AuxCoords -- often larger than the data variables. That is exactly why we implemented lazy loading for AuxCoords. So it means we will want to control chunking of those variables too. In the near future, we also expect to be dealing with large unstructured grids, which will present the same problem.

I think it could be fine, if we can design a default behaviour that enables us to simplify the simple cases. I'm just a bit wary, as it isn't immediately obvious to me how that can work.

@pp-mo
Copy link
Member

pp-mo commented Jul 16, 2019

Hints of progress ?
#3357 (comment)

@pp-mo
Copy link
Member

pp-mo commented Jul 19, 2019

Cross-copied from #3357
I'm now trying to retrospectively separate these 2 discussions ! ...

@TomekTrzeciak Xarray has a nice idea there to accept chunks given as a mapping of dimension name to chunk size.

@pp-mo A really nice spot 💐, but it still has the same problem of being tied to the low-level file encoding, which must then be both known (to the user) and stable : In this case we need dimension names, which don't actually appear anywhere in the iris/CF data representation.
If we could use names of dimension coordinates, that would help loads with this. But I think we can only do that if we first "pre-load", then "re-load" the file. 😢

Also, technically, as the occurrence and order of dimensions will differ between variables, you might still want dimensions divided differently for different variables. But I agree that is probably an obscure case : In principle you could even have an X(r:74, t:20000, y:1500, x:2000) and a Y(x:2000, t:20000, r:74), but I think it would be really rare -- maybe a usecase in ocean data ??

@pp-mo
Copy link
Member

pp-mo commented Jul 19, 2019

Updated in understanding (mine, anyway) ...

occurrence and order of dimensions will differ between variables

This is true, but I think quite rare, as stated.
However, where I said above ...

the chunk reduction assumes c-order contiguity, so will be worst-case if earlier dimensions vary faster in the file.

Though that may be true for the abstract 'as_lazy_data' call, I now think that is probably not so for netcdf data.
In at least the overwhelming majority of files (all the files I've seen), you can assume C-ordering of dimension variables. And that really ought to inform chunking policy. However, the dask auto-chunking policy is agnostic on dimension order,
Which at present seems to me a good reason for not relying on it totally (retaining our own policy).

Frustratingly, I can't find a clear statement of this anywhere.
I think it is not guaranteed for HDF5 data generally, as that has a very general access model.
It might well be "a given" for data created via netcdf interface. But I can't find any clear statement of that.

@rockdoc
Copy link

rockdoc commented Jul 30, 2019

Hi @pp-mo One possible alternative solution -- or a part-solution -- would be to support the specification, by the user, of a chunking hint. There would be a default setting, of course: the canonical one (whatever that might be). I adopted this approach in a Python utility I developed for writing multiple compressed variables to netCDF files with different chunking strategies.

Looking at the code, I can see that my utility supported the following chunking hints:

# Chunk size adjustment hints
HINT_INTEGER_MULTIPLE_OF_ONE_CHUNK = 1
HINT_DECREMENT_BY_ONE_FROM_LEFT    = 2
HINT_INCREMENT_BY_ONE_FROM_RIGHT   = 3
HINT_SET_TO_ONE_FROM_LEFT          = 4
HINT_SET_TO_MAX_FROM_RIGHT         = 5

Without digging around in the low-level code, I can't remember off the top of my head what each of these hints led to in terms of chunking policy. But that doesn't matter here; you'd obviously choose a variety of hints suitable for chunking dask arrays in different ways.

Your solution might want to fall back on a default chunking hint/policy for those cases when the user (or calling program) doesn't specify chunk sizes explicitly. It sounds like Iris is already implementing a default policy, even if it is just the dask default.

Anyways, just thought I'd throw this into the mix, although it might not be suitable for the current use-case.

(PS: If you did want to snoop around the code, I can sort that out - it's in a private MetO repo.)

@pp-mo
Copy link
Member

pp-mo commented Aug 21, 2019

Interested this too ? @cpelley

@pp-mo
Copy link
Member

pp-mo commented Feb 7, 2022

Following some more recent experiences, I'm changing my mind on this.
Although it can't be done "in Iris terms", I'm now convinced by a few examples I've seen that only user chunking can usefully fix some of the performance limitations.

My key motivating example:

  • a loaded cube merges multiple input files : each contains a 3d datacube, dimensions (nz, ny, nz), for a single timestep : result is a time sequence --> dimensions (nt, nz, ny, nx)
  • the user extracted a single vertical level --> cube[:, i_lev] --> (nt, ny, nx)
  • when fetched, this attempts to fetch the whole data and only then extract a single vertical level, which exceeds machine memory
  • when saved to a result file, this takes about 20 x longer than equivalent data of the same size

So, I now believe we really do need to enable user chunking control in such cases,
but for reasons above, it will need to be in file-specific terms.
Also for reasons above, I think it makes sense and will be needed to apply chunking controls

  • (a) to specific file variables, and
  • (b) to set chunk sizes for specific file dimensions

@pp-mo
Copy link
Member

pp-mo commented Feb 7, 2022

key motivating example:

I believe that #4448 is also a very similar problem, with possibly a similar solution

@pp-mo
Copy link
Member

pp-mo commented Feb 9, 2022

Hot news! I wrote a draft something that I'm hoping may be useable for this : #4572

@pp-mo
Copy link
Member

pp-mo commented Oct 27, 2022

See also : #4994 -- since xarray already offers chunking control. Though, that does not provide variable-specific control as suggested in #4572.
I'm not saying that this is the way to solve all problems in future -- we should probably still add chunking control to Iris.
But it may be a useful stopgap.

@trexfeathers trexfeathers added the Dragon 🐉 https://github.com/orgs/SciTools/projects/19?pane=info label Jul 4, 2023
@SciTools SciTools locked and limited conversation to collaborators Jul 28, 2023
@trexfeathers trexfeathers converted this issue into discussion #5401 Jul 28, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Dragon 🐉 https://github.com/orgs/SciTools/projects/19?pane=info Feature: ESMValTool Status: Decision Required Type: Performance
Projects
Development

No branches or pull requests

8 participants