Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinate inheritance for xarray.DataTree #9077

Open
shoyer opened this issue Jun 7, 2024 · 19 comments
Open

Coordinate inheritance for xarray.DataTree #9077

shoyer opened this issue Jun 7, 2024 · 19 comments
Labels
design question topic-DataTree Related to the implementation of a DataTree class

Comments

@shoyer
Copy link
Member

shoyer commented Jun 7, 2024

What is your issue?

Should coordinate variables be inherited between different levels of an Xarray DataTree?

The DataTree object is intended to represent hierarchical groups of data in Xarray, similar to the role of sub-directories in a filesystem or HDF5/netCDF4 groups. A key design question is if/how to enforce coordinate consistency between different levels of a DataTree hierarchy.

As a concrete example of how enforcing coordinate consistency could be useful, consider the following hypothetical DataTree, representing a mix of weather data and satellite images:
image

Here there are four different coordinate variables, which apply to variables in the DataTree in different ways:

  • time is a shared coordinate used by both weather and satellite variables
  • station is used only for weather variables
  • x and y are only use for satellite images

In this data model, coordinate variables are inherited to descendent nodes, which means that variables at different levels of a hierarchical DataTree are always aligned. Placing the time variable at the root node automatically indicates that it applies to all descendent nodes. Similarly, station is in the base weather_data node, because it applies to all weather variables, both directly in weather_data and in the temperature sub-tree. Accessing any of the lower level trees as an xarray.Dataset would automatically include coordinates from higher levels (e.g., time).

In an alternative data model, coordinate variables at every level of a DataTree are independent. This is the model currently implemented in the experimental DataTree project. To represent the same data, coordinate variables would need to be duplicated alongside data variables at every level of the hierarchy:
image

Which data model to prefer depends on which of two considerations we value more:

  1. Consistency: Automatically inherited coordinates will allow for DataTree objects with fewer redundant variables, which is easier to understand at a glance, similar to the role of the shared coordinate system on xarray.Dataset. You don’t need to separately check the time coordinates on the weather and satellite data to know that they are the same. Alignment, including matching coordinates and dimension sizes, is enforced by the data model.
  2. Flexibility: Enforcing consistency limits how you can organize data, because conflicting coordinates at different levels of a DataTree can no longer be represented in Xarray’s data model. In particular, some valid multi-group netCDF4 files/Zarr could not be loaded into a single DataTree object.

As a concrete example of what we lose in flexibility, consider the following two representations of an multiscale image pyramid, where each level of zoom has different x and y coordinates:

image

The version that places the base image at the root of the hierarchy would not be allowed in the inherited coordinates data model, because there would be conflicting x and y coordinates (or dimension sizes) between the root and child nodes. Instead, different levels of zoom would need to be placed under different groups (zoom_1x, zoom_2x, etc).

As we consider making this change to the (as yet unreleased) DataTree object in Xarray, I have two questions for prospective DataTree users:

  1. Do you agree that giving up the flexibility of independent coordinates in favor of a data model that bakes in more consistency guarantees is a good idea?
  2. Do you have existing uses for DataTree objects or multi-group netCDF/Zarr files that would be positively or negatively impacted by this change?

xref: #9063, #9056

CC @TomNicholas, @keewis, @owenlittlejohns, @flamingbear, @eni-awowale

@shoyer shoyer added the needs triage Issue that has not been reviewed by xarray team member label Jun 7, 2024
@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Jun 7, 2024
@TomNicholas
Copy link
Contributor

TomNicholas commented Jun 7, 2024

This is an incredible write-up of the question @shoyer ! This will be very useful for getting people's feedback on.

Nit: The only part I think could be clearer is that IIUC the reason that the second image pyramid is invalid in the inherited coordinates model is because the (x, y) dimensions on the zoom_x2/zoom_x4 nodes have different sizes than they do in the root node. It's not invalid just because there are coordinates named x/y in two levels of the tree. (If they were the same size - i.e. no difference in zoom level - then that would be valid - the child node's x coordinate would take precedence over the parents' x coordinate.) Your write-up does already indicate this, but the diagram might be a little clearer if it listed the dimension sizes not just the dimension names?

@dcherian dcherian removed the needs triage Issue that has not been reviewed by xarray team member label Jun 7, 2024
@etienneschalk
Copy link
Contributor

Hello,

First, thanks for the wonderful diagrams, they convey the message so well and help to reason about the topic!

I have been working with the Zarr format and satellite imagery recently, as well as with the experimental datatree python package and I can provide you some feedback.


I will start with an example as maybe it will explain my point of view better. I work with a data structure looking like this:

- (root)
    - image
        - resolution_10m (coords (y, x) at 10m)
            - band_a     (variable, dims = (y, x))
            - band_b     (variable, dims = (y, x))
            - ...
        - resolution_20m (coords (y, x) at 20m)
        - resolution_60m (coords (y, x) at 60m)
    - mask
        - resolution_10m (coords (y, x) at 10m) (duplicated)
        - resolution_20m (coords (y, x) at 20m) (duplicated)
        - resolution_60m (coords (y, x) at 60m) (duplicated)

Each resolution_* group is a valid Dataset. Bands variables could in theory be stacked to a single 3-D array, but they are kept as 2-D rasters (y, x). The main benefit of using groups is avoiding the suffixing of dimension names (eg, no y_60m y_20m etc), and having cleaner dataset with less "suffixed coordinates noise".


About Consistency vs Flexibility:

Until now I favored Flexibility

In an alternative data model, coordinate variables at every level of a DataTree are independent.

Datasets already allow to share coordinates between variables. Until now, I did not did had a need for coordinate sharing on tree-level. I used DataTree as a way to organize collections of individually valid Datasets hierarchically, and for its nice interface to work with Zarr stores.

While data duplication seems bad at first glance, it has also benefits: self-sufficient Zarr leaf groups that can be loaded as Datasets without the need of DataTree. I try (when possible) to use the simplest data structure available to represent my data. It means, favoring Datasets over DataTrees and favoring DataArrays over Datasets. So I have two options: open the full Zarr store as a DataTree, or only open the leaf group that I am interested in as a Dataset.

Feedback

  1. Do you agree that giving up the flexibility of independent coordinates in favor of a data model that bakes in more consistency guarantees is a good idea?

In particular, some valid multi-group netCDF4 files/Zarr could not be loaded into a single DataTree object.

One of the selling points of xarray is as an universal data-reader. It can open many datasets without too many preconditions on those datasets. However, some limitations have existed, like the inability of xarray to write or read at once multiple NetCDF groups ( #1092 ). I understand that datatree aims to solve this family of problems. So I think it would be nice if xarray tried to do its best to open as many datasets as possible.

Could we have both Consistency and Flexibility, with a scope mechanism? For instance, the right case of the third schema is invalid because no override is possible: the names x and y are shared. Adding override to inheritance could solve the problem. In many programming languages, variables are block-scoped, and thus one can reuse a same variable name, with the new variable being used in that block. The "Invalid with inherited coordinates" schema could be valid if a scope mechanism were implemented, favouring the closest found coordinate and escalating up in the hierarchy if not found. I am aware that it is easier said than done, as shown in #9056, I miss the in-depth understanding of datatree, as I used only a fraction of its features (reading Zarr stores mainly).

In a flat Dataset, there is a single shared namespace for dimension names, thus a dimension name can only exist once (this is a limitation for some applications when you have matrices of the same dimension, you cannot have for instance (b, b) dims, you need to distinguish them, eg (b_k, b_l))

  1. Do you have existing uses for DataTree objects or multi-group netCDF/Zarr files that would be positively or negatively impacted by this change?

I don't think my Zarr stores would be impacted negatively, because all my coordinates are on leaves. I can safely reuse my x and y dimensions, with different sizes, as they are enclosed in different leaf resolution groups. My data is like the left case of the third schema. It feels the more secure and robust to any future changes in datatree, as a user. Using coordinate inheritance would reduce a little bit of 1-D coordinate duplication, but it would impose new constraints on my data structure: for instance, the resolution groups should be higher in the tree hierarchy to allow the reuse of x and y coordinates downstream, while for now, I have the liberty to place resolution groups at the bottom of the hierarchy, as it makes the high-level structure more readable and avoid its own duplication inside of potential top-level resolution groups.

Exemple:

Current:

- image
    - resolution_10m (coords x, y at 10m)
    - resolution_20m (coords x, y at 20m)
    - resolution_60m (coords x, y at 60m)
- mask
    - resolution_10m (coords x, y at 10m) (duplicated)
    - resolution_20m (coords x, y at 20m) (duplicated)
    - resolution_60m (coords x, y at 60m) (duplicated)

What it would resemble by using coordinate inheritance:

- resolution_10m (coords x, y at 20m)
    - image
    - mask
- resolution_20m (coords x, y at 20m)
    - image (duplicated)
    - mask (duplicated)
- resolution_60m (coords x, y at 20m)
    - image (duplicated)
    - mask (duplicated)

-> The high level structure (img, mask) is duplicated for each resolution group. I prefer the flexibility of the first structure at the cost of consistency.

About Zarr groups

Accessing a Zarr group inside of a Zarr directory without knowing about the other groups is a usecase where a Zarr store is more used as a database from which we only extract the necessary information, while being agnostic of the full structure. Like for a database query, we don't necessarily need to know about the complete schema if we only need to use a few tables.

Also, if you load a Zarr group, it seemed to me that it could not go and fetch variables higher up in the hierarchy. If you load group='b/c but the group c needs a coordinate at b level, it would mean we need to go outside of the scope of the group to retrieve such information, and that the group would not be self-sufficient anymore. It seems this is the topic related to #9056 (DataTree access to variables in parent groups). However, I don't know if Zarr groups are meant to be self-sufficient in the first place, so maybe this is not a problem. I just have the feeling of a potential technical problem when accessing Zarr stores on bucket storage, as the Zarr group path has a "physical" meaning on the filesystem.

To reuse the schemas of the original post, maybe an user needs to access a Zarr database but only needs the weather_data, not the satellite_image. The weather_data and satellite_data can be cut off of the main branch and be self-sufficient trees on their own in the second schema (Flexibility), but not in the first (Consistency). However, there is then no guarantee that the time is similar, indeed. The 'central authority' (root Datatree, first schema) could ensure that. However, this consistency could also be enforced at writing time, not at reading time.

About the memory footprint of coordinates duplication

In my own experience, coordinates are usually lighter than data. For instance, for a 2-dimensional square raster image of side N, the data occupies N**2 in memory, while the associated coordinates occupy 2*N. As a result, he ratio coordinates / data is 2*N / N**2 = 2/N. As the data grows in size, the coordinates contribute less and less to the memory footprit. However this statement is only valid for 1-D coordinates and not Multidimensional Coordinates. On my side I rarely used Multidimensional Coordinates. However, it may be a frequently used feature. Because if the memory usage increase due to data duplication seems not significant for 1-D coordinates, it becomes a problem if coordinates have the same shape as the data itself, eg 2-D coordinates for 2-D rasters.

@etienneschalk
Copy link
Contributor

etienneschalk commented Jun 8, 2024

I tried to create an actual DataTree out of the example shown on the first schema, if it can help:

DataTree creation code
import pandas as pd
import numpy as np
import xarray as xr
from xarray.core import datatree as dt

xdt = dt.DataTree.from_dict(
    name="(root)",
    d={
        "/": xr.Dataset(
            coords={
                "time": xr.DataArray(
                    data=pd.date_range(start="2020-12-01", end="2020-12-02", freq="D")[
                        :2
                    ],
                    dims="time",
                    attrs={
                        "units": "date",
                        "long_name": "Time of acquisition",
                    },
                )
            },
            attrs={
                "description": "Root Hypothetical DataTree with heterogeneous data: weather and satellite"
            },
        ),
        "/weather_data": xr.Dataset(
            coords={
                "station": xr.DataArray(
                    data=list("abcdef"),
                    dims="station",
                    attrs={
                        "units": "dl",
                        "long_name": "Station of acquisition",
                    },
                )
            },
            data_vars={
                "wind_speed": xr.DataArray(
                    np.ones((2, 6)) * 2,
                    dims=("time", "station"),
                    attrs={
                        "units": "meter/sec",
                        "long_name": "Wind speed",
                    },
                ),
                "pressure": xr.DataArray(
                    np.ones((2, 6)) * 3,
                    dims=("time", "station"),
                    attrs={
                        "units": "hectopascals",
                        "long_name": "Time of acquisition",
                    },
                ),
            },
            attrs={"description": "Weather data node, inheriting the 'time' dimension"},
        ),
        "/weather_data/temperature": xr.Dataset(
            data_vars={
                "air_temperature": xr.DataArray(
                    np.ones((2, 6)) * 3,
                    dims=("time", "station"),
                    attrs={
                        "units": "kelvin",
                        "long_name": "Air temperature",
                    },
                ),
                "dewpoint_temp": xr.DataArray(
                    np.ones((2, 6)) * 4,
                    dims=("time", "station"),
                    attrs={
                        "units": "kelvin",
                        "long_name": "Dew point temperature",
                    },
                ),
            },
            attrs={
                "description": (
                    "Temperature, subnode of the weather data node, "
                    "inheriting the 'time' dimension from root and 'station' "
                    "dimension from the Temperature group."
                )
            },
        ),
        "/satellite_image": xr.Dataset(
            coords={"x": [10, 20, 30], "y": [90, 80, 70]},
            data_vars={
                "infrared": xr.DataArray(
                    np.ones((2, 3, 3)) * 5, dims=("time", "y", "x")
                ),
                "true_color": xr.DataArray(
                    np.ones((2, 3, 3)) * 6, dims=("time", "y", "x")
                ),
            },
        ),
    },
)
display(xdt)
print(xdt)
DataTree('(root)', parent=None)
│   Dimensions:  (time: 2)
│   Coordinates:
│     * time     (time) datetime64[ns] 16B 2020-12-01 2020-12-02Data variables:
│       *empty*Attributes:
│       description:  Root Hypothetical DataTree with heterogeneous data: weather...
├── DataTree('weather_data')
│   │   Dimensions:     (station: 6, time: 2)
│   │   Coordinates:
│   │     * station     (station) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f'
│   │   Dimensions without coordinates: time
│   │   Data variables:
│   │       wind_speed  (time, station) float64 96B 2.0 2.0 2.0 2.0 ... 2.0 2.0 2.0 2.0
│   │       pressure    (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0 3.0
│   │   Attributes:
│   │       description:  Weather data node, inheriting the 'time' dimension
│   └── DataTree('temperature')
│           Dimensions:          (time: 2, station: 6)
│           Dimensions without coordinates: time, stationData variables:
│               air_temperature  (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0dewpoint_temp    (time, station) float64 96B 4.0 4.0 4.0 4.0 ... 4.0 4.0 4.0Attributes:
│               description:  Temperature, subnode of the weather data node, inheriting t...
└── DataTree('satellite_image')
        Dimensions:     (x: 3, y: 3, time: 2)
        Coordinates:
          * x           (x) int64 24B 10 20 30
          * y           (y) int64 24B 90 80 70
        Dimensions without coordinates: time
        Data variables:
            infrared    (time, y, x) float64 144B 5.0 5.0 5.0 5.0 ... 5.0 5.0 5.0 5.0
            true_color  (time, y, x) float64 144B 6.0 6.0 6.0 6.0 ... 6.0 6.0 6.0 6.0

@shoyer
Copy link
Member Author

shoyer commented Jun 9, 2024

Your write-up does already indicate this, but the diagram might be a little clearer if it listed the dimension sizes not just the dimension names?

Good suggestion, done!

Could we have both Consistency and Flexibility, with a scope mechanism?

This is definitely an alterantive worth considering! In my opinion, inheritance of coordinates with optional overrides would definitely be preferrable to a lack of any inheritance at all, because it reduces the need to have redundant coordinates. We can probably figure out a reasonable way to adjust the DataTree repr to be make it (somewhat) clear when a coordinate is overriden.

That said, my preference would be overrides with strictly enforced alignment. The problem is that with optional overrides, you have no way of guaranteeing alignment between hierarchy levels when building a dataset. You could set time at the root, and then accidentally add weather data with the wrong time axis without any check.

@etienneschalk
Copy link
Contributor

So, if I understand correctly, the first time a dimension appears in hierarchy, it imposes its size for the rest of the nodes lower in the hierarchy. Overriding would be only possible if the coordinate variable downstream still has the same size as its parent. It means, we can still benefit from coordinate inheritance while benefiting from namespacing for groups at the condition that siblings coordinate variables are declared at the same level in the tree hierarchy (3rd schema on the left: any potential subgroups of the "parallel" zoom groups would herit the x and y coordinates of their zoom group). Then, what would be a concrete usecase of overriding a coordinate variable with same size and different coordinates ? I realize that I don't really see any need coordinate overriding in my usecase actually.

The only issue would be that xarray cannot open all valid zarr files (but still way more than currently with sole Datasets!), and it can be problematic for people who read zarr stores that could potentially not be xarray-compatible ; but an application producing zarrs using xarray will always produce valid xarray-readable zarrs (that's really the most important to me). And it will not be a problem as xarray-compatible zarrs will just be a subset of all valid zarrs.

What I mean is that someone using xarray as a tool to produce data will probably think primarily in terms of the in-memory data structures of xarray, rather than the actual file storage formats and their shenanigans, especially if multiple of these formats must be produced and xarray is the lowest common denominator of them all.

@shoyer
Copy link
Member Author

shoyer commented Jun 11, 2024

So, if I understand correctly, the first time a dimension appears in hierarchy, it imposes its size for the rest of the nodes lower in the hierarchy. Overriding would be only possible if the coordinate variable downstream still has the same size as its parent.

In principle, we could also allow overriding dimension sizes lower in a hierarchy.

I agree that I do not see a concrete use case for optional overrides, beyond allowing reading arbitrary hierarchical netCDF/Zarr files in Xarray. However, this would sacrifice a great deal of guaranteed consistency, and in my opinion would not be a worthwhile tradeoff.

@TomNicholas
Copy link
Contributor

TomNicholas commented Jun 11, 2024

[when opening a single group of a zarr store from object storage] we need to go outside of the scope of the group to retrieve such information

@etienneschalk it's a good question, but I think that calling open_datatree('store.zarr', group='not/the/root') could simply ignore anything above the specific group you asked it to open. (open_datatree is really better thought of as open_subtree in this case.)


Some other good points came up in today's meeting from @eni-awowale :

  • It would still be possible to open individual groups of non-compliant netCDF files into xarray via the group kwarg to open_dataset, and subtrees of non-compliant netCDF files via the group kwarg to open_datatree (see open_datatree performance improvement on NetCDF, H5, and Zarr files #9014).
  • Without the ability to open any netCDF file, we actually can't even list all the groups in a file, so we can't use open_datatree(path).groups() to close Opening a dataset doesn't display groups. #4840 in the general case.
  • It would be nice to provide another way to list the groups in any netCDF file, even if that file is not compliant.
  • We could imagine a separate list_groups function, but another idea would be to add another function to the API something like open_dataset_as_dict(path) (or some other better name), whose return type would be a dict[str, xr.Dataset].
  • This would mean DataTree.from_dict(open_dataset_as_dict(path)) would be equivalent to open_datatree(path) for compliant files, but if your file was not compliant you could still list the groups via list(open_datatree_as_dict(path).keys()).
  • Another advantage of this would be if you wanted to open up an almost-compliant file, do some manipulation (e.g. dropping offending coordinates at the root), then convert it to a true DataTree object, that becomes pretty easy.

@flamingbear
Copy link
Contributor

@TomNicholas do you mean open_datatree_as_dict(path) or am I missing something?

@hmaarrfk
Copy link
Contributor

I've been very excited about datatree's "suggested feature" to break away from enforcing consistent coordinates between data arrays.

Inherited coordinates would allow us to have multiple different "x" and "y" coordinates that can come together in a single structure.

Without the "inherited coordinate" flexibility, we would likely have to start looking elsewhere as our application complexity grows.

Generally speaking, we have:

  1. One main data that is generated by our instrument.
  2. Applications then use that data and extract insights from it.
  3. Then would then like to store it "alongside" the data, but not necessary clobber or conflict anything. For example the original data may be archived at a higher or lower resolution.

For now, our "solution" is to "prefix" all our "new data" with something like application_name so that x and y for the application are application_name_x and application_name_y. For us, they share the same units, but they may be on a different "grid".

the flexibility of having different trees is the main advantage I see in creating a hierarchy.

@danielfromearth
Copy link

This is such an interesting and important question. Thanks @shoyer for opening the question up for discussion in such a clear manner!

Not sure I have too much to add, but in my opinion, the inheritance of coordinate variables is a very intuitive concept, and one that is assumed by a lot of data producers. Lots of existing data sets (at least for NASA Earth Science data, my focus) are structured assuming inherited coordinate variables. Nevertheless, I think having the ability to override this to allow for more flexibility if needed may be a good way to approach it.

It would be nice to provide another way to list the groups in any netCDF file, even if that file is not compliant.

I second this point!

@TomNicholas
Copy link
Contributor

TomNicholas commented Jun 13, 2024

@TomNicholas do you mean open_datatree_as_dict(path) or am I missing something?

@flamingbear I don't know what to call it, but the most literal name would be open_as_dict_of_datasets. As I said the important thing here is that the return type of the function would be dict[str, xr.Dataset], i.e. no DataTree objects, so this suggestion would be flexible enough to open any netCDF file.

@eni-awowale
Copy link
Contributor

eni-awowale commented Jun 13, 2024

This has been great a conversation and it's been interesting hearing everyone's insights. I think this inheritance model truly captures the spirit of CF conventions. I do worry that that this model might be too restrictive for some ultra Quirky Data, unless we incorporate some additional functionality like open_datatree_as_dict. As a user, I like that I can just use open_datatree() to open any netCDF4 or he5 granule regardless of whether or not the data is CF compliant because often times it's not.

Here is an example dataset that I think would be invalid against this model. The dimensions in the root group are sounding_dim and vertex_dim. In the group 'Offset' the dimensions are signalbin_dim, footprint_dim, and statistics_dim. The group 'Sequences' has a dimensions sequences_dim and dimensions from the root group sounding_dim. Lastly, the group 'Cloud' has only the dimension sounding_dim from the root group. I am unclear if in this model it would inherit vertex_dim from the root group.

DataTree('None', parent=None)
│   Dimensions:                (sounding_dim: 188677, vertex_dim: 4)
│   Dimensions without coordinates: sounding_dim, vertex_dim
│   Data variables: (12/15)
│       Delta_Time             (sounding_dim) float64 2MB ...
│       SZA                    (sounding_dim) float32 755kB ...
│       VZA                    (sounding_dim) float32 755kB ...
│       SAz                    (sounding_dim) float32 755kB ...
│       VAz                    (sounding_dim) float32 755kB ...
│       Longitude              (sounding_dim) float32 755kB ...
│       ...                     ...
│       SIF_740nm              (sounding_dim) float32 755kB ...
│       SIF_Uncertainty_740nm  (sounding_dim) float32 755kB ...
│       Daily_SIF_740nm        (sounding_dim) float32 755kB ...
│       Daily_SIF_757nm        (sounding_dim) float32 755kB ...
│       Daily_SIF_771nm        (sounding_dim) float32 755kB ...
│       Quality_Flag           (sounding_dim) float64 2MB ...
│   Attributes: (12/32)
│       References:                        ['Sun, Y. et al., Remote Sensing of En...
│       conventions:                       CF-1.6
│       product_version:                   B11012Ar
│       summary:                           Fraunhofer-line based SIF retrievals
│       keywords:                          ISS, OCO-2, Solar Induced Fluorescence...
│       keywords_vocabulary:               NASA Global Change Master Directory (G...
│       ...                                ...
│       InputBuildId:                      B11.0.06
│       InputPointers:                     oco2_L2MetGL_39883a_211231_B11006r_220...
│       CoordSysBuilder:                   ucar.nc2.dataset.conv.CF1Convention
│       identifier_product_doi_authority:  http:https://dx.doi.org/
│       gesdisc_collection:                11r
│       identifier_product_doi:            10.5067/OTRE7KQS8AU8
├── DataTree('Cloud')
│       Dimensions:             (sounding_dim: 188677)
│       Dimensions without coordinates: sounding_dim
│       Data variables:
│           surface_albedo_abp  (sounding_dim) float32 755kB ...
│           cloud_flag_abp      (sounding_dim) float64 2MB ...
│           delta_pressure_abp  (sounding_dim) float32 755kB ...
│           co2_ratio           (sounding_dim) float32 755kB ...
│           o2_ratio            (sounding_dim) float32 755kB ...
├── DataTree('Geolocation')
│       Dimensions:                       (sounding_dim: 188677, vertex_dim: 4)
│       Dimensions without coordinates: sounding_dim, vertex_dim
│       Data variables:
│           time_tai93                    (sounding_dim) datetime64[ns] 2MB ...
│           solar_zenith_angle            (sounding_dim) float32 755kB ...
│           solar_azimuth_angle           (sounding_dim) float32 755kB ...
│           sensor_zenith_angle           (sounding_dim) float32 755kB ...
│           sensor_azimuth_angle          (sounding_dim) float32 755kB ...
│           altitude                      (sounding_dim) float32 755kB ...
│           longitude                     (sounding_dim) float32 755kB ...
│           latitude                      (sounding_dim) float32 755kB ...
│           footprint_longitude_vertices  (sounding_dim, vertex_dim) float32 3MB ...
│           footprint_latitude_vertices   (sounding_dim, vertex_dim) float32 3MB ...
├── DataTree('Metadata')
│       Dimensions:          (sounding_dim: 188677)
│       Dimensions without coordinates: sounding_dim
│       Data variables:
│           CollectionLabel  <U17 68B ...
│           BuildId          <U8 32B ...
│           OrbitId          (sounding_dim) float64 2MB ...
│           SoundingId       (sounding_dim) float64 2MB ...
│           FootprintId      (sounding_dim) float64 2MB ...
│           MeasurementMode  (sounding_dim) float64 2MB ...
├── DataTree('Meteo')
│       Dimensions:                 (sounding_dim: 188677)
│       Dimensions without coordinates: sounding_dim
│       Data variables:
│           surface_pressure        (sounding_dim) float32 755kB ...
│           specific_humidity       (sounding_dim) float32 755kB ...
│           vapor_pressure_deficit  (sounding_dim) float32 755kB ...
│           temperature_skin        (sounding_dim) float32 755kB ...
│           temperature_two_meter   (sounding_dim) float32 755kB ...
│           wind_speed              (sounding_dim) float32 755kB ...
├── DataTree('Offset')
│       Dimensions:                    (signalbin_dim: 227, footprint_dim: 8,
│                                       statistics_dim: 2)
│       Dimensions without coordinates: signalbin_dim, footprint_dim, statistics_dim
│       Data variables: (12/13)
│           signal_histogram_bins      (signalbin_dim) float32 908B ...
│           signal_histogram_757nm     (signalbin_dim, footprint_dim) float64 15kB ...
│           signal_histogram_771nm     (signalbin_dim, footprint_dim) float64 15kB ...
│           SIF_Relative_Mean_757nm    (signalbin_dim, footprint_dim, statistics_dim) float32 15kB ...
│           SIF_Mean_757nm             (signalbin_dim, footprint_dim, statistics_dim) float32 15kB ...
│           SIF_Relative_Median_757nm  (signalbin_dim, footprint_dim, statistics_dim) float32 15kB ...
│           ...                         ...
│           SIF_Relative_SDev_757nm    (signalbin_dim, footprint_dim, statistics_dim) float32 15kB ...
│           SIF_Relative_Mean_771nm    (signalbin_dim, footprint_dim, statistics_dim) float32 15kB ...
│           SIF_Mean_771nm             (signalbin_dim, footprint_dim, statistics_dim) float32 15kB ...
│           SIF_Relative_Median_771nm  (signalbin_dim, footprint_dim, statistics_dim) float32 15kB ...
│           SIF_Median_771nm           (signalbin_dim, footprint_dim, statistics_dim) float32 15kB ...
│           SIF_Relative_SDev_771nm    (signalbin_dim, footprint_dim, statistics_dim) float32 15kB ...
├── DataTree('Science')
│       Dimensions:                        (sounding_dim: 188677)
│       Dimensions without coordinates: sounding_dim
│       Data variables: (12/16)
│           sounding_qual_flag             (sounding_dim) float64 2MB ...
│           IGBP_index                     (sounding_dim) float64 2MB ...
│           continuum_radiance_757nm       (sounding_dim) float32 755kB ...
│           SIF_757nm                      (sounding_dim) float32 755kB ...
│           SIF_Unadjusted_757nm           (sounding_dim) float32 755kB ...
│           SIF_Relative_757nm             (sounding_dim) float32 755kB ...
│           ...                             ...
│           SIF_Unadjusted_771nm           (sounding_dim) float32 755kB ...
│           SIF_Relative_771nm             (sounding_dim) float32 755kB ...
│           SIF_Unadjusted_Relative_771nm  (sounding_dim) float32 755kB ...
│           SIF_Uncertainty_771nm          (sounding_dim) float32 755kB ...
│           daily_correction_factor        (sounding_dim) float32 755kB ...
│           sounding_land_fraction         (sounding_dim) float32 755kB ...
└── DataTree('Sequences')
        Dimensions:         (sequences_dim: 0, sounding_dim: 188677)
        Dimensions without coordinates: sequences_dim, sounding_dim
        Data variables:
            SequencesName   (sequences_dim) <U1 0B ...
            SequencesId     (sequences_dim) <U1 0B ...
            SequencesMode   (sequences_dim) <U1 0B ...
            SequencesIndex  (sounding_dim) float64 2MB ...
            SegmentsIndex   (sounding_dim) float64 2MB ..

I like that open_datatree opens up HDF files without enforcing any sort of CF compliance. We've been testing DataTree on dozens of quirky collections and it's worked on all of them so far. This is getting kind of philosophical I suppose, but I think we shouldn't be restricting the usability of our tool in any way, but this feels like it does.

I also want to add that I don't think we should necessarily make our decision based on one quirky collection at our DAAC. Though, I think this kind of quirkiness isn't unique to just GES DISC and NASA. Here is the data download link for anyone who is interested in looking at this granule.

Edit:
Since the dimensions in the groups I mentioned above have different names than the root group this inheritance model would be valid for this dataset. Thanks for the clarification @shoyer!

@flamingbear
Copy link
Contributor

literal name would be open_as_dict_of_datasets.

@TomNicholas Gotcha. I'm back on board. And I think this is an important feature. I've found one dataset that already would fail to validate, but I think the ability to open and see all the groups will be a useful feature. As well as the ability tweak the coords to be compliant before creating the tree.

@itcarroll
Copy link
Contributor

itcarroll commented Jun 13, 2024

Your Question

Do you agree that giving up the flexibility of independent coordinates in favor of a data model that bakes in more consistency guarantees is a good idea?

TL;DR I'm for inheritance with overrides that can include differences in dimension size.

Consistency that breaks the ability to read valid netCDF4 files does not strike me as a good idea, and not just (I hope!) because I'm also affiliated with a NASA DAAC. @eni-awowale said it quite clearly: "As a user, I like that I can just use open_datatree() to open any netCDF4 or he5 granule regardless of whether or not the data is CF compliant because often times it's not." The following example does not even violate CF Conventions, which I think is mum on inheritance (although the NUG is not).

If I understand the consistency data model correctly, the following would generate a "test.nc" that DataTree would not open:

with netCDF4.Dataset("test.nc", "w") as test:
    test.createDimension("x", 3)
    a = test.createGroup("a")
    a.createDimension("x", 4)
    x = a.createVariable("x", "b", ("x",))

For those not used to working with netCDF4 directly, note that there is no coordinate object. You have to mentally model coordinates as variables with the same name as their only dimension.

I would want XArray to read this file into a DataTree, and, like netCDF, just leave the root level x dimension as a dead end. It's not referenced. Could @shoyer elaborate on concerns regarding different coordinates with the same name found in different groups? Because I'm not seeing a downside. I believe netCDF4 does a search on name up the tree, starting from the current group; isn't that sufficient for assigning coordinates to data variables?

My Question

What would DataTree do with this "quirky" dataset when allowing inheritance:

with netCDF4.Dataset("test.nc", "w") as test:
    test.createDimension("x", 3)
    a = test.createGroup("a")
    x = a.createVariable("x", "b", ("x",))
    b = test.createGroup("b")
    y = a.createVariable("y", "b", ("x",))

Here the placement of the x dimension, but not the x variable with coordinate data, suggests y should get x as a coordinate. This is a lot like the datasets our DAAC distributes, and if DataTree would handle it by giving y the x coordinate ... well, that would bring me hope for making our data more XArray friendly. Edit: Alternatively, a DataTree.set_coords method for cross-group setting would be nice, and eliminate complications having to do with a potential third x in yet another group.

Thank you for posting the question for discussion!

@autydp
Copy link

autydp commented Jun 19, 2024

Hi all. I work closely with Matt, Owen and Eni, but I thought I would add my perspective here. And apologies if I am not up on all of the nuances here, but I definitely recommend flexibility over enforcement here, or at least if there is the possibility to allow the flexible option vs. enforcement - that should be considered.

Unfortunately, as much as we want our data designers to adhere to best practices, we rarely get a chance to influence their decisions. There may be good reasons for their choices, but this kind of data integrity rarely gets considered. I've just responded to another new mission data design this morning (NISAR) which is choosing to put all of their dimension variables in a "sidecar" group off of root - as in their case, there are a number of dimensions being considered and it probably seemed too cluttered at the root level.

We call these datasets "quirky data" - and while it is not always the case, there are just too many cases not to try to make things a bit easier for rectifying the data. These issues always take up more time and effort than is imagined or planned for.

@TomNicholas
Copy link
Contributor

Thanks everyone for your input here!

Here's what I'm hearing:

the inheritance of coordinate variables is a very intuitive concept

Inherited coordinates would allow us to have multiple different "x" and "y" coordinates that can come together in a single structure.

consistency is definitely the better way. (from twitter)

But also:

One of the selling points of xarray is as an universal data-reader.

Consistency that breaks the ability to read valid netCDF4 files does not strike me as a good idea

"quirky data" - ... there are just too many cases not to try to make things a bit easier for rectifying the data.

So tl;dr People would like inherited coordinates, but only if they can still easily read "quirky" datasets into xarray.

From our point of view as developers, we strongly favour having a single global behaviour which provides strict contracts that we can rely on and reason about. This allows us to write clearer abstractions, develop quicker, have clearer documentation, and less chance of confusing behaviour. xarray.Dataset has been successful in this respect, enforcing alignment between variables to allow the set of all dimension sizes of a single Dataset to be represented as dict[str, int].


Given this we are strongly leaning towards coordinate inheritance with strict alignment checks at DataTree construction time, but also adding an additional open_dict_of_datasets function to ensure that any valid netCDF file can still be opened by xarray, and facilitate rectifying such data before analysis. See #9137.

@owenlittlejohns
Copy link
Contributor

owenlittlejohns commented Jun 20, 2024

I think this all sounds good. A couple of thoughts I had looking at this file...

Some 1-D coordinate variables have associated "bounds" variables per the CF-Conventions. This led me to two questions:

  1. Should we be propagating/inheriting those bounds variables if they are present?
  2. These bounds variables are referred to within the metadata attributes of the 1-D coordinate variables. There may also be other references to variables in the metadata attributes of 1-D coordinate variables. Do we need to update those metadata attribute references in the objects passed down to child nodes?

(Note - the example file above doesn't have the issues with the questions I'm mentioning, as all the variables are in a single group, and so there are no child nodes to pass things down to. I just wanted to give an example of what bounds might look like)

@TomNicholas
Copy link
Contributor

Some 1-D coordinate variables have associated "bounds" variables per the CF-Conventions. This led me to two questions:

Xarray has a policy of generally not special-casing behaviour to follow any specific metadata conventions. In particular:

For example, we do not automatically interpret and enforce units or CF conventions. (An exception is serialization to and from netCDF files.)

So with this in mind...

Should we be propagating/inheriting those bounds variables if they are present?

They shouldn't be special-cased. So if they are coordinate variables with alignable dimensions then they will be propagated. I believe that will give sensible behaviour anyway - if the "bounds" variables are defined in the same group as the 1D coordinate variable they refer to then they should be accessible from anywhere that the 1D coordinates are.

These bounds variables are referred to within the metadata attributes of the 1-D coordinate variables. There may also be other references to variables in the metadata attributes of 1-D coordinate variables. Do we need to update those metadata attribute references in the objects passed down to child nodes?

Xarray explicitly doesn't do any detailed changes to metadata (see the same FAQ question linked above). IMO changing the metadata like this would fall outside the scope of the data model, as it would require xarray to understand that certain metadata fields are intended to be dynamic and group-dependent. The completely general case of what to do about metadata propagation is tracked in #1614.

@shoyer
Copy link
Member Author

shoyer commented Jun 20, 2024

  • Should we be propagating/inheriting those bounds variables if they are present?

In my current prototype implementation, as long as the bounds variables are stored as "coordinates" they will be automatically propagated to child nodes, unless there is a conflicting coordinate with the same name on the child. This (coordinates that are not indexes) is the only case in which inheritance is not strict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design question topic-DataTree Related to the implementation of a DataTree class
Projects
None yet
Development

No branches or pull requests