Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethinking fix_file #2129

Open
schlunma opened this issue Jul 11, 2023 · 6 comments
Open

Rethinking fix_file #2129

schlunma opened this issue Jul 11, 2023 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@schlunma
Copy link
Contributor

Currently, fx_file has the call signature (filename) -> (filename). Thus, files need to be copied if they need to be modified (overwriting input files is a very bad idea), which is very expensive. Usually, the netCDF4 library is used to perform these kind of modifications.

A possible alternative that has often come up is xarray. At the moment, there is a method DataArray.to_iris() which allows converting a DataArray into a Cube (no idea how efficient this is, though). In addition, the iris devs are working on an improved xarray-iris bridge.

Thus, it would be very nice to allow an efficient usage of xarray in fix_file. My question is now: how would that look like in practice? Possible solutions I could think of are:

  • Allow fx_file to return xarray objects (currently only DataArray?) and load to read these kind of objects.
  • Allow fx_file to return cubes and load to read cubes.
  • Rewrite fx_file in such a way that it is a callback in load (with one of the call signatures above).

The reason I am asking this now is that I want to get started with reading the ERA5 GRIB files that are available on Levante (see #1991). However, iris-grib cannot open them. On the other hand, xarray (in combination with ECMWF's cfgrib) works perfectly fine. Thus, an option to include xarray into our pipeline would be super helpful for me.

@ESMValGroup/technical-lead-development-team opinions?

@bouweandela
Copy link
Member

bouweandela commented Aug 1, 2023

A possible alternative that has often come up is xarray. At the moment, there is a method DataArray.to_iris()which allows converting a DataArray into a Cube (no idea how efficient this is, though).

The issue with the current implementation is of xarray.DataArray.to_iris is that it loses information, if I remember correctly some of the coordinates got lost last time I tried to use it.

@bouweandela
Copy link
Member

bouweandela commented Aug 1, 2023

Rewrite fx_file in such a way that it is a callback in load (with one of the call signatures above).

This seems the cleanest option, but I doubt whether the callback argument is flexible enough for all use cases. A limiting factor of the callback function is that you only have access to a single cube at a time.

@bouweandela
Copy link
Member

If we want to work with data that is not stored in files at some point in the future, it would be useful to either get rid of fix_file altogether, or make it return whatever input it got if it is not a file.

We could consider making load call the to_iris method of its first argument (file) if it has such a method and fall back to calling iris.load_raw if it doesn't have it? It would be really good to take into account the potential use of intake-esm and STAC to load data from the cloud at some point in the future. Does anyone have experience trying to do this through iris, without using xarray? Maybe by using fsspec?

@valeriupredoi
Copy link
Contributor

AFAIK iris can't load file objects eg something stored on an S3 storage so our main problem is not fix_file but the I/O itself - if we get the I/O to work with object stores then we can modify fix_file to work on such a thing I reckon

@valeriupredoi
Copy link
Contributor

I have experience with netCDF files on various types of storages, including object stores, so does @zklaus (am fairly sure of that) - iris is the type of thing you don't want in that case, you want a basic loader into a file format like Zarr or overloaded Zarr - beware of xarray since that transfers the data!

@schlunma
Copy link
Contributor Author

schlunma commented Aug 1, 2023

The issue with the current implementation is of xarray.DataArray.to_iris is that it loses information, if I remember correctly some of the coordinates got lost last time I tried to use it.

I tested this a lot lately for the ERA5 grib files, in which case I didn't find any problems with this function. The lazyness of the data and all relevant metadata is properly preserved.

This seems the cleanest option, but I doubt whether the callback argument is flexible enough for all use cases. A limiting factor of the callback function is that you only have access to a single cube at a time.

Conceptually I agree, but our current callback argument (which is simply passed to iris.load_raw is not suited for that. It would rather be something like a load_func option that defaults to iris.load_raw. However, to ensure backwards-compatibility, we need to allow fix_file to return paths...this will not really make it cleaner.

If we want to work with data that is not stored in files at some point in the future, it would be useful to either get rid of fix_file altogether, or make it return whatever input it got if it is not a file.

I don't think we should get rid of fix_file entirely (we cannot expected Iris to handle every single format natively). fix_file already returns what it gets if no special fix is needed. I added some type hints for that in #2160, but these can be changed to Any.

We could consider making load call the to_iris method of its first argument (file) if it has such a method and fall back to calling iris.load_raw if it doesn't have it? It would be really good to take into account the potential use of intake-esm and STAC to load data from the cloud at some point in the future. Does anyone have experience trying to do this through iris, without using xarray? Maybe by using fsspec?

In #2160, I allowed load to handle CubeList objects (they are simply returned), this is more flexible than relying on xarray's to_iris in my opinion (we are not sure if this works in every case), and the magic xarray -> iris can happen in fix_file. Support for other formats can be implemented to load in subsequent PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants