Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xarray bridge #4994

Closed
1 task done
pp-mo opened this issue Sep 27, 2022 · 19 comments
Closed
1 task done

Xarray bridge #4994

pp-mo opened this issue Sep 27, 2022 · 19 comments
Assignees
Labels
Dragon 🐉 https://github.com/orgs/SciTools/projects/19?pane=info Xarray Bridge
Milestone

Comments

@pp-mo
Copy link
Member

pp-mo commented Sep 27, 2022

Provide a bridge to convert data to+from Xarray

Enable use of Iris+Xarray together, with fast+efficient exchange between the two.

Primary Goals:

  • lossless, file-less exchange of information between Iris and Xarray
    • (i.e. cubes <--> xr.Dataset)
  • should be ~equivalent to file exchange : iris-save/xr-load, or the converse
  • but...
    • fast and easy !
    • no-copy and lazy-preserving transfer of (large) data arrays

Roughly following the ideas (#4835 now F-B https://github.com/SciTools/iris/tree/FEATURE_xarray_readwrite )
And especially to enable possible solutions listed in this comment

Also relevant (to lazy data) : ideas in pp-mo/iris#73
But NB while functional, this approach is very clunky : see generally simpler scheme suggested here

Tasks

  1. pp-mo
@pp-mo
Copy link
Member Author

pp-mo commented Oct 21, 2022

#5024 is the beginnings of a new approach to this.

@dcherian
Copy link
Contributor

lossless, file-less exchange of information between Iris and Xarray

Is there something missing in Xarray's to_iris and from_iris methods? See https://github.com/pydata/xarray/blob/cdab326bab0cf86f96bcce4292f4fae24bddc7b6/xarray/convert.py

@pp-mo
Copy link
Member Author

pp-mo commented Oct 27, 2022

lossless, file-less exchange of information between Iris and Xarray

Is there something missing in Xarray's to_iris and from_iris methods? See https://github.com/pydata/xarray/blob/cdab326bab0cf86f96bcce4292f4fae24bddc7b6/xarray/convert.py

Not really as such, but there is quite a list of features that we can't get from that which we'd like.
As stated, we'd really like to be able to exchange data freely between the two for a "best of both" approach.

I haven't really looked into the differences between iris.save(cubes) and xr.convert.from_iris(cubes).to_netcdf().
But I have looked extensively into differences between iris.load('var') and xr.convert.to_iris(xr.open_dataset()['var']
Apart from some corner cases and problems which could be addressed, there is a fairly long list of things which Iris recognises, which are lost in loading 'via' xarray in this way.
Here's a summary ...

  • global attributes
  • coordinate systems, attached via a 'grid_mapping' link
  • ancillary variables and cell measures, which behave very much like auxiliary coordinates, but are not (and should not be) treated as coordinates by Xarray
  • "some" auxiliary coordinates, not treated as coordinates by xarray -- notably, string data

We also think there are probably some tricky corners relating to masked/missing data and time-values handling (with more obscure non-standard calendars). But those one can mostly at least work around, by turning off interpretation in xarray.

So, "tactically" here, the elephant in the room is that to_iris operates on a DataArray not a Dataset, so the variables which are connected to data-variable by CF concepts are not available. Hence, no aux-coords, cell-measures, coordinate systems etc.
But also "strategically", as I already said, it's a fundamentally limited approach for Xarray to attempt to second-guess Iris's handling of CF concepts, which really belongs in Iris, since in that respect what Iris does can be seen as (perhaps should be ?!?) an extension to xarray's scope, to add CF handling. (Which sounds rather like cf-array, but that's another story).

@cpelley
Copy link

cpelley commented Oct 28, 2022

ping @TomekTrzeciak FYI

@DPeterK
Copy link
Member

DPeterK commented Oct 28, 2022

@pp-mo I have to say - and following on from @dcherian's very good point - that I'd prefer any work to improve the Iris-XArray bridge to be done as enhancements to the functionality that already exists within XArray, rather than being added as new functionality within Iris. I think it would be confusing for users to have multiple APIs for this at all (e.g. xr.dataset.to_iris() and cube.to_xarray_dataset()) let alone also have them return differing results...

@dcherian to add to @pp-mo's very detailed answer above, the main thing is that the round-trip (that is Iris - XArray - Iris) results in a different end object to what was started with. Apologies - we've been meaning to contribute to XArray's Iris bridge with Iris-perspective updates so that the round-trip is lossless for years, because the functionality is super useful, but unfortunately other work pressures have always blocked us doing so 😑

@dcherian
Copy link
Contributor

Apologies - we've been meaning to contribute to XArray's Iris bridge with Iris-perspective updates so that the round-trip is lossless for years, because the functionality is super useful, but unfortunately other work pressures have always blocked us doing so

I fully understand :)

That said, given all the subtle details in @pp-mo's answer, perhaps this code would fit better, and be better maintained, in Iris than in Xarray. Note that pandas to xarray conversion code lives in xarray and pandas just calls that. So we could just have xarray.Dataset.to_iris call the iris method

@DPeterK
Copy link
Member

DPeterK commented Nov 1, 2022

ping @jacobtomlinson, to whom I mentioned this work earlier

@pp-mo
Copy link
Member Author

pp-mo commented Jan 5, 2023

Latest ideas on this, are for it to become a new independent repo "scitools/ncdata"
Xarray-Iris bridge proposal -- NcData.pdf

A key idea is that it exists in its own space, with no required dependency on either Xarray or Iris

@pp-mo
Copy link
Member Author

pp-mo commented Jan 5, 2023

Latest ideas on this, are for it to become a new independent repo "scitools/ncdata"

Current status : I have functional code, and we have pretty much decided now on a plan...

  • set up a separate scitools repo
  • generic python classes for handling netcdf-data
  • free+fast conversion between these and any of : netcdf file data / iris cubes / xarray datasets

Frankly though, this will take a while to establish properly : I have as yet no home repo, no tests written + plenty of other priorities getting in the way.

@pp-mo
Copy link
Member Author

pp-mo commented Jan 9, 2023

Latest ideas on this, are for it to become a new independent repo "scitools/ncdata" Xarray-Iris bridge proposal -- NcData.pdf
A key idea is that it exists in its own space, with no required dependency on either Xarray or Iris

Key spoiler alert : I believe this will now work without requiring any changes to Xarray.

I know that @TomekTrzeciak was generally pessimistic about getting xarray to be fully transparent for round-trips.
So, when we add more extensive tests, we may well find that we do want some changes in Xarray.
But AFAICT so far the main needs are covered and this need not be a serious blocker : We don't necessarily require a completely bulletproof solution out of the box, and we can proceed without worrying too much about awkward corner cases.

@pp-mo
Copy link
Member Author

pp-mo commented Jan 25, 2023

Update :
I just updated the PDF which I previously attached.
Xarray-Iris bridge proposal -- NcData.pdf
Because someone noticed that the links in the original did not work (thanks @lbdreyer !)

Please note that this contains a link to the private "Xarray WIP" PR, i.e. "NcData within Iris"
I have kept that as a branch + PR within my own fork, since I now believe we really don't want to implement it in Iris !

@larsbarring
Copy link
Contributor

since I now believe we really don't want to implement it in Iris

Is this intended to be an alternative way to the normal iris.load/iris.save way for loading/saving netCDF data, or do you imagine that in the future iris will build on this?

@pp-mo
Copy link
Member Author

pp-mo commented Jan 25, 2023

Is this intended to be an alternative way to the normal iris.load/iris.save way for loading/saving netCDF data, or do you imagine that in the future iris will build on this?

No, I think we're seeing this only as an alternate route where special needs apply.
Apart from letting us load from "problem files", and save "bad CF", the Xarray interoperability gives us a "plug and play" access to additional capabilities like Zarr/HDF support and (probably) better chunking control.

What happens in Iris is all much the same, since Iris treats it as netcdf load+save (not a new format).
So that means it should remain ~feature-compatible, but also will always impose an extra performance overhead.

@dcherian
Copy link
Contributor

dcherian commented Jan 25, 2023

@pp-mo If you (or anyone else too!) are interested, we have our regular Bi-weekly Xarray meeting next week on Feb 1, 2023. I think your plan could make for very interesting discussion.

@pp-mo
Copy link
Member Author

pp-mo commented Jan 30, 2023

@pp-mo If you (or anyone else too!) are interested, we have our regular Bi-weekly Xarray meeting next week on Feb 1, 2023. I think your plan could make for very interesting discussion.

Thanks for the suggestion - I imagine that might be useful.
I haven't attended these previously, do I need to prepare anything ?

@dcherian
Copy link
Contributor

dcherian commented Jan 30, 2023

do I need to prepare anything

I think just a brief overview of your plans and any feedback you'd like. You could talk through it, the meeting is meant for quick in-person iteration, and is quite informal.

(cc @pydata/xarray)

PS: Unfortunately, I'll be out of town.

@pp-mo
Copy link
Member Author

pp-mo commented Jul 3, 2023

Latest approaching Iris 3.7

@pp-mo
Copy link
Member Author

pp-mo commented Jan 5, 2024

Status update 2023-01-05

My latest solutions are now here : https://github.com/pp-mo/ncdata

I've just added packaging and made available on PyPi, conda-forge pending.
The current 0.0.1 / 0.0.2 builds have some bugs in the docs, and fail to keep data lazy on passing to Iris
-- but those key problems are now fixed on main.
Will fix a couple more things and cut a v0.1, when we get the conda channel presence.

@TomekTrzeciak
Copy link

My latest solutions are now here : https://github.com/pp-mo/ncdata

@pp-mo, sweet to see this progressing and taking shape and also your ideas from ncobj being taken forward 👍.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dragon 🐉 https://github.com/orgs/SciTools/projects/19?pane=info Xarray Bridge
Projects
Status: 💰 Finished
Status: 🏁 Done
Status: 🏁 Done
Development

No branches or pull requests

8 participants