Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion of a "cf-xarray" package #771

Closed
jthielen opened this issue Apr 22, 2020 · 17 comments
Closed

Discussion of a "cf-xarray" package #771

jthielen opened this issue Apr 22, 2020 · 17 comments

Comments

@jthielen
Copy link

jthielen commented Apr 22, 2020

Below is a brief outline about a potential cf-xarray package, a library for parsing CF metadata in xarray objects to provide convenience methods on accessors for common operations. Hopefully this can start a discussion here about such a package after some initial comments here: JiaweiZhuang/xESMF#74

These are all just my own interpretations of things at the moment based on my experiences with similar issues in MetPy, so please offer suggestions for modifications and improvements!


Breaking it down by top-level items of the current version (1.8) of the CF Conventions document:

NetCDF Files and Components

Relevant details are mostly covered by xarray itself (except for perhaps groups, which have open issues on xarray, e.g., pydata/xarray#2916).

Description of the Data

Units

While the units attribute allows basic tracking of the units within xarray itself, work towards more substantial unit support has been a longstanding effort in the community (pydata/xarray#525). Efforts have been converging recently around Pint integration with xarray, including the upcoming pint-xarray package to include unit-related functionality on accessors. Some (non-negligible) work will need to be done to ensure complete CF/UDUNITS compatibility, but this is something we have been and will continue to be working on in MetPy (Unidata/MetPy#1362).

Options to consider here:

  • Leave units out of cf-xarray, and just refer to pint-xarray in examples
  • Integrate pint-xarray into cf-xarray, relying on the modified Pint registry from MetPy being used here upstream for better CF/UDUNITS compatibility

cf-units is another package to keep in mind here, which has sort of an inverse problem to Pint: it is already fully CF/UDUNITS compliant, but doesn't have a corresponding duck array type or set of NumPy functions so that it could integrate with xarray (at least without attribute operation hooks).

Standard Name

A simple API for getting variables by their standard name (wrapping filter_by_attrs) would be useful. Perhaps

  • ds.cf[standard_name] or
  • ds.cf.search_standard_name(standard_name)

Doing detailed parsing of constructed standard names and automatically applying appropriate operations to calculate them may be a cool feature, but I definitely wouldn't consider it a priority until there is a demonstrated need.

Ancillary Data

Another useful place for a simple helper. Something like ds.cf.ancillary_variables(varname) could return an iterable of linked ancillary variables in the dataset.

Other Subsections

Long Name and Flags are other subsections here, but I'm not sure if there is anything useful for cf-xarray to do here.

Coordinate Types

This is one of the major motivating factors for a common cf-xarray package as brought up in JiaweiZhuang/xESMF#74. It has also been one of the core components of MetPy's accessor. There is definitely a broader need for these features, and at least one non-meteorology-specific package (xrviz) has an optional dependency on MetPy in order to take advantage of them.

The crux of MetPy's coordinate type identification comes in its check_axis function. This works by scanning the variable attributes for

  • standard_name, an optional CF criteria for all four types
  • _CoordinateAxisType, to shortcut identification if already identified by a THREDDS server
  • axis, another optional CF criteria for direct identification of each type
  • positive, a CF requirement when a non-pressure vertical coordinate is present
  • units, which has particular requirements for longitude, latitude, vertical pressure, and certain time coordinates

If all those fail, MetPy also falls back to some conservative regex matching of variable names (but this is something I would not expect to see carried over to cf-xarray).

A discussion of the API here is definitely in order, as it will likely be one of the central components (not just for pulling out coordinates of a particular type like da.cf.X, but also convenience wrappers like da.cf.sum(axis="X") that handle automatic coordinate type recognition for much of xarray's API). The canonical labels from CF are X, Y, Z, and T. MetPy's implementation has diverged from this in favor of x, longitude, y, latitude, vertical, and time for a few reasons:

  • we've had issues with treating both projection x coordinates and longitude under the "x" label, particularly with the need to access both for different purposes at the same time (xref Change xarray coordinate identification to enforce dimensionality and allow both lat/lon and x/y identification Unidata/MetPy#1090)
  • The most common vertical coordinate we work with is pressure, so calling it "z" or "Z", when that's the direct identifier of the second-most commonly used vertical coordinate (height/geopotential height) would have caused unnecessary confusion. It was also decided upon to be "vertical" over the alternative of "level" or "levels" in early discussion.
  • Given "vertical" became a longer label, it felt natural to also just go with "time" instead of "t" or "T"

Any preferences here on direct CF labels, MetPy-style labels, or some other solution?

xref geoxarray/geoxarray#10

Other Components

cftime is another package worth mentioning here that falls under this section of Coordinate Types

Would it be safe to leave out any special handling/parsing of parametric vertical coordinates, one of the other topics mentioned in the CF conventions under this section?

Coordinate Systems

Some of the earlier subsections of the CF conventions here (e.g., Independent Latitude, Longitude, Vertical, and Time Axes; Two-Dimensional Latitude, Longitude, Coordinate Variables) are addressed more in the coordinate identification above, but what is particularly worth noting here is the subsection on Horizontal Coordinate Reference Systems, Grid Mappings, and Projections.

This has been discussed at length elsewhere (particularly pydata/xarray#2288 and #356), so for now I'll just defer to those discussions for details. Also, here is @djhoese's relevant comment from the preceding discussion on JiaweiZhuang/xESMF#74:

Another person I think should be kept in the loop in this discussion is @snowman2 who maintains pyproj and rioxarray. I bring these projects up because:

  1. pyproj has the ability to convert CRS information to/from CF definitions. I recently switched some of my projects to depending on it for this.
  2. rioxarray is a good example of using xarray accessors to deal with various headaches that this project may run in to. For example, defining which dimensions represent which geographic dimension ("x" and "y" versus "lon" and "lat" or any other odd naming that may exist in the wild).

The combination of these two projects has resulted in something that I've thought is much better than what I was attempting in geoxarray. @snowman2 creates a "spatial_ref" coordinate variable which itself has a crs_wkt attribute (the WKT version of the Coordinate Reference System). I think you can then also copy other coordinate variables like x/y to this spatial_ref variable. This has the benefit of holding on to CRS information and making it easy to access where xarray may have dropped .attrs or coordinate variables in other implementations.

Edit: I should have mentioned that I'd like geoxarray into something similar to rioxarray but not rasterio specific (no rasterio/gdal dependency).

In short, no matter how the details work out in the background, I'd imagine an API here of something like da.cf.crs, to get some kind of standard CRS object, which can then be converted as needed for data transformations, georeferenced calculations, and plotting.

Labels and Alternative Coordinates

I don't think there is anything for cf-xarray to do here?

Data Representative of Cells

Another big need of cf-xarray which has been brought up in a lot of discussion in the past (#356). I'm less well-versed in this area, so I'd want to defer to others on the best APIs for getting appropriate cell bounds from coordinates. One other question I wanted to raise: is there anything that cf-xarray should do with respect to climatological statistics, which also falls under this section?

Reduction of Dataset Size

Would it be within scope to include helpers for uncompressing gathered data using MultiIndexes and sparse arrays?

Discrete Sampling Geometries

Would any special handling of DSG be within scope here (such as utilities for Pandas/GeoPandas conversion like Unidata/MetPy#1074)?


I think that's all, so again, please offer input/feedback/suggestions/improvements! I'm tagging several people that I saw spoke up on prior related issues, but please feel to loop anyone else into the discussion that I missed or who would be able to offer input.

cc @dcherian, @djhoese, @rabernat, @snowman2, @huard, @JiaweiZhuang, @rsignell-usgs, @martindurant, @hdsingh, @bekozi, @fmaussion, @dopplershift

@rabernat
Copy link
Member

Thank you @jthielen for taking the time to write up this very comprehensive and thoughtful issue. Also tagging @shoyer.

@martindurant
Copy link
Contributor

martindurant commented Apr 22, 2020

While the CF convention undoubtedly is useful for earth-surface/atmosphere coordinates, I wonder to what extent the ideas here can be generalised to more coordinate types. For instance, I was once an astronomer, and although they would (probably) never use CDF/HDF, they might use zarr and xarray (or other). I hope and expect there are many other uses.

@dopplershift
Copy link
Contributor

@martindurant My question on that would be to what end? What are the use cases that generalizing something like (I assume) a coordinate system object? Because both the input here (netCDF CF metdata conventions) and the output (likely cartopy and proj) are pretty earth-system specific. Not completely against it, but I'm not seeing what the common middle layer would provide.

@martindurant
Copy link
Contributor

input here (netCDF CF metdata conventions) and the output (likely cartopy and proj)

If those are the constraints, then you are right - but xarray is broader than that, so I could see this as being just one of a set of possible coordinate mapping conventions.

@jthielen
Copy link
Author

jthielen commented Apr 22, 2020

input here (netCDF CF metdata conventions) and the output (likely cartopy and proj)

If those are the constraints, then you are right - but xarray is broader than that, so I could see this as being just one of a set of possible coordinate mapping conventions.

The focus here (for "cf-xarray") was pretty strictly on CF metadata interpretation and application. I think additional discussion on coordinate mapping conventions for xarray in general would be great, but may be suited more for the discussion in #356?

@snowman2
Copy link

Possibly helpful to be aware of these CF/pyproj issues:

@dopplershift
Copy link
Contributor

dopplershift commented Apr 22, 2020

Right but what I see is:

Cf -> Common Middle -> Proj
Astro Conventions -> Common Middle -> Astro Tool

Unless it makes sense to map astro conventions to proj or CF metadata to the astro tool (which I'm assuming it doesn't), then that "Common Middle" needs some kind of shared and useful functionality to justify its existence, no? That's what I'm missing.

@martindurant
Copy link
Contributor

You may be right; I commented there too, but the focus was more along the lines of what "projections" (i.e., transformation) mean and how we can store them rather than on what typical names for various physical quantities like "height", "frequency", "spherical surface coords" might be.

@dcherian
Copy link
Contributor

It seems like there are 2 kinds of "useful" attributes.

  1. Attributes to be decoded: e.g. bounds variables to some kind of IntervalIndex, units to pint thingies etc. basically an advanced xr.decode_cf(). In fact, we should think about how much of this would go right into xarray itself since xarray already decodes all CF attributes it can make use of in some way or the other.
  2. Attributes to be interpreted: e.g. _CoordinateAxisType to allow things like .cf.sum(dim="X") and .cf.plot(x="X", y="T") or .cf["X"].

1 is dependent on xarary making progress on things like duck array wrapping and custom indexes (i.e. will take time).

2 could be implemented right now as an extension of whatever metpy has AFAICT.

@dopplershift
Copy link
Contributor

From MetPy's perspective, we're happy to contribute what we have to a broader effort for interpreting CF metadata within xarray. I'm not sure how generally useful it is, but it certainly meets our requirements:

  • Identification of coordinates by type (e.g. time, longitude, vertical)
  • Selection by specifying only e.g. time, vertical rather than using a hard-code dimension name
  • Unit-support in selection
  • Automatic generation of CartoPy projections from CF metadata

The overall goal is to facilitate changing of datasets without needing to adjust code--at least where that makes sense and the dataset contains the requisite metadata to make this a possibility.

Step one of doing CF stuff with xarray may be fixing the parts where xarray blows up (ok, errors out) on CF-compliant netCDF files (see pydata/xarray#2233 and pydata/xarray#2368). It's been on my todo list for quite awhile to try to work on this problem but I can't seem to stop accumulating things on top this in my own stack.

@rabernat
Copy link
Member

I think @dopplershift's proposal sounds perfect.

One very ironic thing is that a cf-python package already exists:
https://ncas-cms.github.io/cf-python/

The group at Reading has basically created an entirely new stack for climate data analysis which duplicates the functionality of both xarray and dask! I have tried unsuccessfully to convince them to collaborate more (https://bitbucket.org/cfpython/cf-python/issues/51/collaborate-more-closely-with-xarray-iris).

@kmpaul
Copy link

kmpaul commented Apr 27, 2020

This is of great interest to the GeoCAT team here at NCAR, too. There's a lot of interest in seeing a solution here, and there is a lot of "small" projects out there that have attempted to solve part of the problem.

CC: @NCAR/xdev @NCAR/geocat

@kmpaul
Copy link

kmpaul commented Apr 27, 2020

(P.S. Anyone know why the NCAR team @mentions above don't seem to work? Both teams are publicly visible, but they don't link...which means I don't think they are getting notifications.)

@djhoese
Copy link
Contributor

djhoese commented Apr 27, 2020

(P.S. Anyone know why the NCAR team @mentions above don't seem to work? Both teams are publicly visible, but they don't link...which means I don't think they are getting notifications.)

I don't think you can link to teams defined outside the current org/repository.

@kmpaul
Copy link

kmpaul commented Apr 27, 2020

@djhoese: That's disappointing. For some reason, I thought you could. Kinda makes it hard to loop in potential external collaborators, then.

@dcherian
Copy link
Contributor

dcherian commented Jun 14, 2020

With the help of @kmpaul's and @jthielen's input and code from MetPy, cf-xarray is now alive and welcoming contributions!

https://github.com/xarray-contrib/cf-xarray
https://cf-xarray.readthedocs.io/en/latest/examples/introduction.html

@jthielen
Copy link
Author

jthielen commented Jul 8, 2020

With https://github.com/xarray-contrib/cf-xarray being well on its way (already to v0.1.5), I think this issue can be safely closed and any cf-xarray discussion moved to its issue tracker. Thank you again @dcherian!

@jthielen jthielen closed this as completed Jul 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants