Hierarchical coordinates in DataTree #9063

shoyer · 2024-06-03T23:30:29Z

Proof of concept implementation of #9056

Still probably needs more tests, suggestions welcome!

It's also possible that I've implemented this in a rather inelegant fashion, I am still wrapping my head around the tree-node abstraction.

CC @TomNicholas, @keewis, @owenlittlejohns, @flamingbear, @eni-awowale

TomNicholas · 2024-06-04T06:01:18Z

This is cool to see! And it shows that simply calling align on the data in parent nodes can be neat way to perform the check for inheritability.

However this PR doesn't actually implement the suggestion in #9056, because it places a new strong restriction on the datatree data model. As well as restricting the range of valid DataTree objects, it would actually mean that there are netCDF files that DataTree objects cannot represent:

# using `main`
import xarray as xr
from xarray.core.datatree import DataTree

# create the unalignable tree from your tests
dt = DataTree.from_dict(
    {
        "/": xr.Dataset({"a": (("x",), [1, 2])}),
        "/b": xr.Dataset({"c": (("x",), [3])}),
    }
)

# write a valid netCDF file...
dt.to_netcdf('test.nc')

# switch branch to `datatree-hierarchy`
from xarray.backends.api import open_datatree

dt2 = open_datatree('test.nc')  # will raise an error from align

ValueError: inconsistent alignment between node and parent datasets:

The alternative suggestion I made in #9056 was instead to provide an API which performs inheritance like this, but doesn't apply that restriction to all trees at construction time.

I was working on my own PR in #9056 which does a similar alignment check but only when the user calls a specific .inherit method. The downside of my approach is that the alignment would be checked every time the user calls .inherit, which is really inefficient:

dt.inherit['x']  # calls `align` once
dt.inherit['x']  # calls `align` again!

I think what we actually want is somewhere between the two: we want to calculate all the inheritable coordinates at tree construction time like you've done, but not actually enforce that by raising errors from align. Instead we cache the list of which variables are inheritable (again like you've done), but keep them hidden until we merge them into the node once the user calls .inherit (like I've tried to do).

TomNicholas · 2024-06-04T06:05:29Z

xarray/core/datatree.py

+def _check_alignment(
+ node_ds: Dataset,
+ parent_ds: Dataset | None,
+ children: Mapping[Hashable, DataTree],


Suggested change

def _check_alignment(

node_ds: Dataset,

parent_ds: Dataset | None,

children: Mapping[Hashable, DataTree],

def _find_alignable_coordinates(

node_ds: Dataset,

parent_ds: Dataset | None,

) -> Coordinates:

Then we would store those coordinates on the node at construction time but not use them unless the user calls .inherit.

TomNicholas · 2024-06-20T18:22:24Z

However this PR doesn't actually implement the suggestion in #9056, because it places a new strong restriction on the datatree data model.

I withdraw this complaint 😁 - see #9077 for discussion as to whether or not to impose this contraint.

@shoyer this suggestion might make no sense, but IIUC currently in this PR inherited variables & indexes are literally copied onto child nodes, relying on xarray's shallow copying features to not duplicate any underlying data. What I wonder is whether instead we could simply store the list of names of the inherited variables on the child node as ._inherited_var_names: set[str], and anytime we want access to them we simply look back up the tree until we find it. With a strictly-aligned model we can be sure that they definitely do exist further up the tree.

We could hide this detail by having properties like .variables and ._local_variables or similar, so that code which accesses these variables doesn't have to be changed. But then the model is a bit simpler as we're only storing string keys to represent the inherited variables (it's a little more like how coord_names is internally just a set[Hashable]).

The main tricky part I see with this approach is making sure the correct variable is altered when you try to write to an inherited variable, but I think we have to handle that carefully with either approach.

shoyer · 2024-06-20T18:22:37Z

xarray/core/datatree.py

@@ -480,6 +489,7 @@ def to_dataset(self) -> Dataset:
 --------
 DataTree.ds
 """
+ # TODO: copy these container objects?


Note: We definitely need tests to verify the to_dataset() creates an new copy of the node's data. (I suspect that current it does not.)

shoyer · 2024-06-20T18:23:40Z

@TomNicholas CI is passing now, so this should be in a reasonable state for you to take a look!

shoyer · 2024-06-20T18:28:00Z

@shoyer this suggestion might make no sense, but IIUC currently in this PR inherited variables & indexes are literally copied onto child nodes, relying on xarray's shallow copying features to not duplicate any underlying data. What I wonder is whether instead we could simply store the list of names of the inherited variables on the child node as ._inherited_var_names: set[str], and anytime we want access to them we simply look back up the tree until we find it. With a strictly-aligned model we can be sure that they definitely do exist further up the tree.

Indeed, this is how I implemented it currently. Can you think of use-cases for distinguishing node vs inherited data? I guess this could be a nice option for .to_dataset() and for writing data to Zarr?

TomNicholas · 2024-06-20T18:32:11Z

Oh right! I think we might need to be able to distinguish internally when doing map_over_subtree

shoyer · 2024-06-20T23:15:54Z

Oh right! I think we might need to be able to distinguish internally when doing map_over_subtree

I agree, it seems pretty clear that we need support for pulling out "node local" versions of datasets, without inheritance.

For this PR, I'll implement a keyword option on DataTree.to_dataset to allow for only pulling out these local variables. This might be spelled local=True or inherited=False, and will ensure that we have the right information in the DataTree data model.

I'll also add a test to verify that to_zarr() works properly, without duplicating coordinates.

shoyer · 2024-06-21T21:24:35Z

I implemented to_dataset(local=True). Would love feedback!

One messy thing is that it appears that assignment via .coords is broken on DataTree even at main. No error is raised when you try to set coordinates, but they don't actually get changed, e.g.,

In [1]: from xarray.core.datatree import DataTree
   ...: import xarray as xr
   ...:
   ...: tree = DataTree.from_dict(
   ...:     {
   ...:         "/": xr.Dataset({"e": (("x",), [1, 2])}, coords={"x": [2, 3]}),
   ...:         "/b": xr.Dataset({"f": (("y",), [3])}),
   ...:         "/b/c": xr.Dataset(),
   ...:         "/b/d": xr.Dataset({"g": 4}),
   ...:     }
   ...: )

In [2]: tree
Out[2]:
<xarray.DataTree>
Group: /
│   Dimensions:  (x: 2)
│   Coordinates:
│     * x        (x) int64 16B 2 3
│   Data variables:
│       e        (x) int64 16B 1 2
└── Group: /b
    │   Dimensions:  (y: 1)
    │   Dimensions without coordinates: y
    │   Data variables:
    │       f        (y) int64 8B 3
    ├── Group: /b/c
    └── Group: /b/d
            Dimensions:  ()
            Data variables:
                g        int64 8B 4

In [3]: tree.coords['y'] = 5

In [4]: tree
Out[4]:
<xarray.DataTree>
Group: /
│   Dimensions:  (x: 2)
│   Coordinates:
│     * x        (x) int64 16B 2 3
│   Data variables:
│       *empty*
└── Group: /b
    │   Dimensions:  (y: 1)
    │   Dimensions without coordinates: y
    │   Data variables:
    │       f        (y) int64 8B 3
    ├── Group: /b/c
    └── Group: /b/d
            Dimensions:  ()
            Data variables:
                g        int64 8B 4

So I'm not quite sure how to test modifying coordinates yet. Maybe we'll leave that for a follow-up PR.

TomNicholas · 2024-06-25T14:53:49Z

One messy thing is that it appears that assignment via .coords is broken on DataTree even at main.

Ah yes - I forgot there's a TODO deep somewhere for that 😅 I left it because it for later because it seemed like it might require changing the DatasetCoordinates class that ds.coords returns. (Surely that class could just be replaced by the new xr.Coordinates class now??)

xarray/core/datatree_io.py

TomNicholas · 2024-06-25T15:09:29Z

What I wonder is whether instead we could simply store the list of names of the inherited variables on the child node as ._inherited_var_names: set[str], and anytime we want access to them we simply look back up the tree until we find it. With a strictly-aligned model we can be sure that they definitely do exist further up the tree.

Indeed, this is how I implemented it currently.

Hmm, looking again I'm not sure this is true. You have the attributes defining the DataTree node as

class DataTree:
    _name: str | None
    _parent: DataTree | None
    _children: dict[str, DataTree]
    _cache: dict[str, Any]  # used by _CachedAccessor
    _local_variables: dict[Hashable, Variable]
    _local_coord_names: set[Hashable]
    _local_dims: dict[Hashable, int]
    _local_indexes: dict[Hashable, Index]
    _variables: dict[Hashable, Variable]
    _coord_names: set[Hashable]
    _dims: dict[Hashable, int]
    _indexes: dict[Hashable, Index]
    _attrs: dict[Hashable, Any] | None
    _encoding: dict[Hashable, Any] | None
    _close: Callable[[], None] | None

i.e. every local variable (and index) are explicitly stored on every node, and every inherited variable (and index) are also stored explicitly on every node. What I was suggesting was something more like

class DataTree
    _name: str | None
    _parent: DataTree | None
    _children: dict[str, DataTree]
    _cache: dict[str, Any]  # used by _CachedAccessor
    _local_variables: dict[Hashable, Variable]  # these 4 attributes we already have, they're  just renamed
    _local_coord_names: set[Hashable]
    _local_dims: dict[Hashable, int]
    _local_indexes: dict[Hashable, Index]
    _inherited_coord_names: set[Hashable]  # note this is now the only new attribute needed compared to current non-inheriting model
    _attrs: dict[Hashable, Any] | None
    _encoding: dict[Hashable, Any] | None
    _close: Callable[[], None] | None

    @property
    def _inherited_coords(self) -> dict[Hashable, Variable]:
        # we know these exist somewhere further up the tree
        return {self._get_inherited_coord_from_above(coord_name) for coord_name in self._inherited_coord_names}

    @property
    def variables(self) -> dict[Hashable, Variable]:
        # we know these are all alignable because we checked it at construction time
        all_variables = {**self._local_variables, **self._inherited_coords}
        return all_variables

    @property
    def _dims(self) -> dict[Hashable, int]:
        """All dims present on local and inherited variables"""
        return ...

where the set of all variables if defined as a property that looks up the tree to fetch the correct inherited variables (so they aren't stored twice).

Basically the point is that the only extra thing I need to know is the names of the variables to inherit, and because I know they definitely exist, I can just generate any combination of local/inherited/all variables dynamically. With this design there are more properties that get built on-the-fly, but there are fewer internal attributes that must be kept consistent with one another.

TomNicholas · 2024-06-28T16:33:49Z

OK, so here's the tricky case: How should we handle checking alignment when there is an inherited dimension that is not found on a direct parent dataset?

Ah, now I realise that this is probably why my grandchild tests in #9187 don't yet work...

This does seem tricky. IIUC the basic problem is that _check_alignment currently checks that local variables align with the local variables in the immediate parent node, then recursively calls itself again to check that variables on each child node align with those on the current node. At no point does it do the alignment check between two nodes that are not immediately adjacent, so if alignment information is not propagated, the chain is broken.

Other ideas for solving this:

Check alignment between every node and every other node - would work, but would be combinatorially expensive so clearly a terrible idea.
Check alignment between local variables and all the variables on the parent, including inherited variables. Inheritance will effectively populate those empty nodes, solving the issue. However if there the alignment check raises it then becomes really hard to generate a diff showing where the offending variables were defined. (Also not totally sure that this isn't circular logic...)

TomNicholas · 2024-06-29T20:23:49Z

How should we handle checking alignment when there is an inherited dimension that is not found on a direct parent dataset?

Thinking about this more I have two questions:

Are we checking alignment against all variables (data and coordinate), but only propagating down coordinate variables? Couldn't that lead to weird situations where a child dataset that aligns with a grandparent's coordinate variables but not with its grandparent's data variables doesn't raise an error? (because the grandparents' data variables aren't inherited down to the empty parent node). I guess what I'm getting at here is being clearer on the difference between inheriting all variables to do alignment checks, and inheriting just coordinate variables in order to make them accessible from child groups.
Does it even make sense for _check_alignment to raise a ValueError of the form
"group {path1} is not aligned with its parent: see {repr(group1)} vs {repr(group2)}"? The alignment criterion is that the current node aligns with the parent node including all coordinates inherited onto that parent node. But those inherited coordinates could have come from multiple different ancestor groups. So a single alignment error message could actually have happened due to variables in two different ancestor groups. For example:

tree = DataTree.from_dict(
    {
        "/": xr.Dataset({"x": (("x",), [1, 2])}),
        "/b": xr.Dataset({"y": (("y",), [1])}),
        "/b/c": xr.Dataset({"d": (("x", "y"), [[3, 4]])}),  # d has size {'x': 1, 'y': 2}, so doesn't align with either parent group.
    }
)

Here "d" doesn't align with "x" or "y", but you can't really say that the alignment error is caused by group "/b" misaligning with group "b/c" or by group "/"misaligning with group"/b/c"`, it's caused by both! It would make much more sense to say that group "/b/c" does not align with its parent group including all variables inherited onto the parent.

shoyer · 2024-06-30T22:18:56Z

The Variable objects in _coord_variables will still be pointed to from multiple places

I see this as a pro, rather than a con. Using the exact same Variable objects means it is possible to modify metadata like attrs in-place .

Our list of internal attributes on a DataTree node is still not just those on a Dataset plus the inherited coordinates

This is indeed a bit of a con, but in my mind the right fix is probably to adjust the Dataset data model to using dictionaries of data_variables and coord_variables, rather than the current solution of a dict of variables and a set of coord_names. Using a separate dictionary for coord_variables would also be more aligned with how DataArray is implemented. The internal Dataset data model is a hold-over from the very early days of Xarray, before we had a notion of coordinate variables that are not indexes.

The ChainMap of coordinates is potentially storing irrelevant information, i.e. anything that gets overridden. Specifically
dt = DataTree.from_dict(
    {
        "/a": xr.Dataset({"x": (("x",), [1, 2])}),  # x coordinate of size 2
        "/a/b": xr.Dataset({"x": (("x",), [3, 4])}),  # x would be inherited, but gets overridden (as this new x coordinate still aligns)
    }
)
Because a ChainMap doesn't overwrite keys, just obscures them, the object dt['/a/b'] now stores a ChainMap which has a reference to the coordinate ('x', [1, 2]). This is despite us knowing at tree construction time that this original x coordinate will never be used or even accessible by the dt['/a/b'] node.

This particular example would actually be an alignment error (x is used as an index), but you are correct that this sort of scenario could (rarely) happen. Specifically, it could happen for coordinate variables that are not indexes, e.g.,

dt = DataTree.from_dict(
    {
        "/a": xr.Dataset(coords={"y": (("x",), [1, 2])}),
        "/a/b": xr.Dataset(coords={"y": (("x",), [3, 4])}),  # y would be overridden
    }
)

I don't really see this as a problem, though. It's true that the higher-level coordinates are not currently accessible, but they are still around if the coordinate on the lower-level group is modified.

I guess I still don't really understand why we are trying to implement this inheritance by storing all the things that might have been inherited (or might have been overridden) in attributes on the local node. The ChainMap seems like a clever way to avoid iterateing upwards through .parents, but I'm not sure that means that every node should store a ChainMap.

On the contrary, basically all ChainMap does is systematize iterating upwards through .parents! I highly recommend taking a look under the hood of ChainMap -- it's quite simple.

Store every coordinate variable only once, keeping only the names of the alignable variables on the child node. Then we iterate up the tree to go fetch the variable only when we need it.

To be clear, my intention was quite similar -- the underlying (node-level) maps at each level of _coord_variables, _dims and _indexes should each exist exactly once. The ChainMap objects are just proxies that link them together.

That said, there's no reason why the ChainMap objects need to be constructed when the DataTree objects are built. Instead, it probably makes more sense to do this dynamically, to avoid a separate redundant representation of the tree structure, e.g.,

@property
def _chained_coord_variables(self):
    maps = [self._coord_variables]
    maps.extend(p._coord_variables for p in parents)
    return ChainMap(*maps)

However, it turns out we can create Dataset objects with extra unused dimensions (via direct constructors), and everything works in align() as you would expect.
My current plan is to do just that: create technically invalid Dataset objects for purposes of checking alignment

This also seems potentially fragile? 😕

It's a little fragile, but these objects never get exposed to users, so the risk is quite contained.

shoyer · 2024-06-30T22:23:42Z

Check alignment between local variables and all the variables on the parent, including inherited variables. Inheritance will effectively populate those empty nodes, solving the issue. However if there the alignment check raises it then becomes really hard to generate a diff showing where the offending variables were defined. (Also not totally sure that this isn't circular logic...)

This is the solution I have currently implemented. I think it's OK -- the inherited coordinate variables are listed on the repr of the parent node -- though indeed it's not as clear as it could be. Ideally we would say exactly at which level all the conflicting coordinates came from.

shoyer · 2024-06-30T22:32:38Z

Are we checking alignment against all variables (data and coordinate), but only propagating down coordinate variables?

We're checking alignment against inherited dimensions and indexes, which exactly what xarray.align() checks -- it doesn't worry about data variables at all, or coordinate variables that are not also stored in indexes.

Does it even make sense for _check_alignment to raise a ValueError of the form
"group {path1} is not aligned with its parent: see {repr(group1)} vs {repr(group2)}"? The alignment criterion is that the current node aligns with the parent node including all coordinates inherited onto that parent node. But those inherited coordinates could have come from multiple different ancestor groups.

Note that the repr for the parent includes inherited variables.

So a single alignment error message could actually have happened due to variables in two different ancestor groups. For example:
tree = DataTree.from_dict(
    {
        "/": xr.Dataset({"x": (("x",), [1, 2])}),
        "/b": xr.Dataset({"y": (("y",), [1])}),
        "/b/c": xr.Dataset({"d": (("x", "y"), [[3, 4]])}),  # d has size {'x': 1, 'y': 2}, so doesn't align with either parent group.
    }
)
Here "d" doesn't align with "x" or "y", but you can't really say that the alignment error is caused by group "/b" misaligning with group "b/c" or by group "/"misaligning with group"/b/c"`, it's caused by both! It would make much more sense to say that group "/b/c" does not align with its parent group including all variables inherited onto the parent.

The current error message (from evaluating your snippet on this branch) is:

ValueError: group '/b/c' is not aligned with its parent:
Group:
    Dimensions:  (x: 1, y: 2)
    Dimensions without coordinates: x, y
    Data variables:
        d        (x, y) int64 16B 3 4
Parent:
    Dimensions:  (x: 2, y: 1)
    Coordinates:
      * x        (x) int64 16B 1 2
      * y        (y) int64 8B 1

I think is pretty descriptive, though perhaps we could more clearly indicate that it includes all parent nodes, even just by switching "parent" to "parents".

shoyer · 2024-06-30T22:20:52Z

xarray/core/datatree.py

+ # Note: rebuild_dims=False can create technically invalid Dataset
+ # objects because it may not contain all dimensions on its direct
+ # member variables, e.g., consider:
+ # tree = DataTree.from_dict(
+ # {
+ # "/": xr.Dataset({"a": (("x",), [1, 2])}), # x has size 2
+ # "/b/c": xr.Dataset({"d": (("x",), [3])}), # x has size1
+ # }
+ # )
+ # However, they are fine for internal use cases, for align() or
+ # building a repr().
+ dims = dict(self._dims)


@TomNicholas This is where we make the invalid Dataset objects, though they are also propagated by _inherited_dataset() (if one of the inputs is invalid).

shoyer · 2024-06-30T22:46:01Z

That said, there's no reason why the ChainMap objects need to be constructed when the DataTree objects are built. Instead, it probably makes more sense to do this dynamically, to avoid a separate redundant representation of the tree structure, e.g.,
@property
def _chained_coord_variables(self):
    maps = [self._coord_variables]
    maps.extend(p._coord_variables for p in parents)
    return ChainMap(*maps)

This is now implemented -- what do you think?

I think this captures of the essence of what you were aiming towards with #9187. Conveniently, it also simplifies the implementation of DataTree.

TomNicholas

Thanks for the thorough reply here @shoyer ! I think my main concerns are now addressed. 🙏

I'm going to approve, though I think there are a couple of little nits that could be neater. I would also love it if @owenlittlejohns / @flamingbear / @eni-awowale could at least take a look and say if they are following the general gist of what's being done here 😁

The Variable objects in _coord_variables will still be pointed to from multiple places

I see this as a pro, rather than a con. Using the exact same Variable objects means it is possible to modify metadata like attrs in-place .

👍 I think my main concern here was really about having two different representations of the tree at once, and that's dealt with now.

Our list of internal attributes on a DataTree node is still not just those on a Dataset plus the inherited coordinates

This is indeed a bit of a con, but in my mind the right fix is probably to adjust the Dataset data model to using dictionaries of data_variables and coord_variables, rather than the current solution of a dict of variables and a set of coord_names.

Interesting! I've split out #9203 to track this idea.

The ChainMap objects are just proxies that link them together.

That said, there's no reason why the ChainMap objects need to be constructed when the DataTree objects are built. Instead, it probably makes more sense to do this dynamically, to avoid a separate redundant representation of the tree structure

This is now implemented -- what do you think?

I think this captures of the essence of what you were aiming towards with #9187. Conveniently, it also simplifies the implementation of DataTree.

Thank you - I think this makes the code a lot easier to understand, both in terms of how the coordinates that go into constructing .ds are found, and in terms of the internal attributes of a single DataTree node being simpler.

We're checking alignment against inherited dimensions and indexes, which exactly what xarray.align() checks -- it doesn't worry about data variables at all, or coordinate variables that are not also stored in indexes.

Okay great. We should make this super clear in the documentation.

Note that the repr for the parent includes inherited variables.

[this error message when grandparents don't align] I think is pretty descriptive, though perhaps we could more clearly indicate that it includes all parent nodes, even just by switching "parent" to "parents".

👍 let's try to indicate that the group represent coordinates from the entire tree above.

[having intermediate datasets that are technically invalid is] a little fragile, but these objects never get exposed to users, so the risk is quite contained.

This still makes me a little worried - I would rather have structures that always obey their own contracts by design, but I agree its a problem that we can completely isolate from the users.

xarray/core/datatree.py

xarray/core/treenode.py

flamingbear · 2024-07-01T21:02:17Z

say if they are following the general gist of what's being done here 😁

I have been following along, I may be a bit behind. I do like the properties for _coord_variables, _dims and _indexes, I think that captures what Tom was going for in a really clean way.

~~I have at least one dumb question I will ask inline.~~ Usually I have to actually ask the question before I figure it out.

shoyer · 2024-07-01T22:46:02Z

👍 let's try to indicate that the group represent coordinates from the entire tree above.

Here's what you currently see, which I think is pretty clear:

ValueError: group '/b' is not aligned with its parents:
Group:
    Dimensions:  (x: 1)
    Dimensions without coordinates: x
    Data variables:
        c        (x) float64 8B 3.0
From parents:
    Dimensions:  (x: 2)
    Dimensions without coordinates: x

[having intermediate datasets that are technically invalid is] a little fragile, but these objects never get exposed to users, so the risk is quite contained.

This still makes me a little worried - I would rather have structures that always obey their own contracts by design, but I agree its a problem that we can completely isolate from the users.

I don't love this either but I think it's an expedient choice for now. DatasetView is also problematic for similar reasons -- maybe even worse because it's exposed to users.

flamingbear

I've followed this though the changes and agree with the decisions, and that we should capture some of the newer tweaks.

keewis

I'm still somewhat uncomfortable with the idea of pulling out inherited coordinates by default when converting a node to a dataset – I would probably have kept the inheritance to DataTree objects by default. I guess my concern is that

xr.open_dataset(..., group="/a")

results in a dataset that is different from either of

xr.open_datatree(...)["/a"].ds
xr.open_datatree(...)["/a"].to_dataset()

which may be pretty confusing at first. This can probably be avoided by changing the repr to mark inherited variables as such (not sure how easy that would be, though, especially since at least the conversion to Dataset would lose that information).

However, other than that this looks good to me, and I don't think this concern is urgent enough to block this PR (and in any case I do realize I'm somewhat late in voicing this concern).

xarray/core/datatree.py

shoyer · 2024-07-03T00:04:36Z

I guess my concern is that
xr.open_dataset(..., group="/a")
results in a dataset that is different from either of
xr.open_datatree(...)["/a"].ds
xr.open_datatree(...)["/a"].to_dataset()

We are going to have a similar issue when we add group to open_datatree, too: open_datatree(..., group="/a") will be different from open_datatree(...)["/a"].

One way to solve this would be to add an inherited argument to open_dataset and open_datatree, but I'm not sure anybody would actually use it. It's probably better to just to document the discrepancy for now.

In practice, setting the group argument to either open_dataset or open_datatree is something I would expect users to will only do for very large files, in performance critical cases. For these power users, I'm OK with a few more surprising edge cases

TomNicholas · 2024-07-03T00:05:17Z

@keewis that's a good point, though I'm not sure what we can do about it other than document the subtlety clearly.

Note that the proposed

xr.open_as_dict_of_datasets(...)["/a"]

in #9137 would behave the way you are expecting, i.e. the same as

xr.open_dataset(..., group="/a")

shoyer · 2024-07-03T00:36:07Z

Thanks everyone for your reviews and feedback, especially @TomNicholas ! This definitely improved the final design.

I'm merging this for now so we can start testing and iterating upon it.

Inheritance of data coordinates

95dbece

TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Jun 4, 2024

TomNicholas added this to In progress in DataTree integration via automation Jun 4, 2024

TomNicholas mentioned this pull request Jun 4, 2024

Sketch of datatree.inherit method #9065

Closed

4 tasks

TomNicholas reviewed Jun 4, 2024

View reviewed changes

shoyer mentioned this pull request Jun 7, 2024

Coordinate inheritance for xarray.DataTree #9077

Open

shoyer added 6 commits June 12, 2024 17:11

Simplify __init__

abf0574

Merge branch 'main' into datatree-hierarchy

dfed099

Include path name in alignment errors

54364d0

Fix some mypy errors

f72fcd5

mypy fix

c4722e0

simplify DataTree data model

c08e810

shoyer commented Jun 20, 2024

View reviewed changes

shoyer added 2 commits June 20, 2024 18:37

Add to_dataset(local=True)

2351b97

Fix mypy failure in tests

03fb991

Fix to_zarr for inherited coords

a611fb1

flamingbear reviewed Jun 25, 2024

View reviewed changes

xarray/core/datatree_io.py Outdated Show resolved Hide resolved

shoyer added 2 commits June 25, 2024 16:14

Fix to_netcdf for heirarchical coords

cef6cfa

Add ChainSet

51b2c43

owenlittlejohns mentioned this pull request Jun 26, 2024

DAS-2155 - Merge datatree documentation into main docs. #9033

Open

3 tasks

Revise internal data model; remove ChainSet

dcb103b

shoyer commented Jun 30, 2024

View reviewed changes

Construct ChainMap objects on demand.

6595e2f

shoyer added 3 commits June 30, 2024 15:46

Merge branch 'main' into datatree-hierarchy

ebeec84

slightly better error message with mis-aligned data trees

88e5f4a

mypy fix

1f6c7b4

shoyer mentioned this pull request Jul 1, 2024

Add copy option in DataTree.from_dict #9193

Open

2 tasks

use float64 instead of float32 for windows

7bb5d86

TomNicholas mentioned this pull request Jul 1, 2024

Refactor Dataset internals to store data variables and coordinate variables as separate dicts #9203

Open

TomNicholas approved these changes Jul 1, 2024

View reviewed changes

TomNicholas reviewed Jul 1, 2024

View reviewed changes

clean-up per review

57e598e

shoyer mentioned this pull request Jul 1, 2024

DataTree.coords.__setitem__ is broken #9204

Open

flamingbear approved these changes Jul 2, 2024

View reviewed changes

keewis approved these changes Jul 2, 2024

View reviewed changes

xarray/core/datatree.py Show resolved Hide resolved

Add note about inheritance to .ds docs

6cfd8ff

shoyer merged commit a86c3ff into pydata:main Jul 3, 2024
28 checks passed

DataTree integration automation moved this from In progress to Done Jul 3, 2024

shoyer deleted the datatree-hierarchy branch July 7, 2024 21:59

TomNicholas mentioned this pull request Jul 9, 2024

Track merging datatree into xarray #8572

Open

27 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hierarchical coordinates in DataTree #9063

Hierarchical coordinates in DataTree #9063

shoyer commented Jun 3, 2024

TomNicholas commented Jun 4, 2024

TomNicholas Jun 4, 2024

TomNicholas commented Jun 20, 2024

shoyer Jun 20, 2024

shoyer commented Jun 20, 2024

shoyer commented Jun 20, 2024

TomNicholas commented Jun 20, 2024

shoyer commented Jun 20, 2024

shoyer commented Jun 21, 2024

TomNicholas commented Jun 25, 2024

TomNicholas commented Jun 25, 2024 •

edited

Loading

TomNicholas commented Jun 28, 2024

TomNicholas commented Jun 29, 2024 •

edited

Loading

shoyer commented Jun 30, 2024

shoyer commented Jun 30, 2024

shoyer commented Jun 30, 2024

shoyer Jun 30, 2024

shoyer commented Jun 30, 2024 •

edited

Loading

TomNicholas left a comment

flamingbear commented Jul 1, 2024 •

edited

Loading

shoyer commented Jul 1, 2024

flamingbear left a comment

keewis left a comment •

edited

Loading

shoyer commented Jul 3, 2024

TomNicholas commented Jul 3, 2024 •

edited

Loading

shoyer commented Jul 3, 2024

Hierarchical coordinates in DataTree #9063

Hierarchical coordinates in DataTree #9063

Conversation

shoyer commented Jun 3, 2024

TomNicholas commented Jun 4, 2024

TomNicholas Jun 4, 2024

Choose a reason for hiding this comment

TomNicholas commented Jun 20, 2024

shoyer Jun 20, 2024

Choose a reason for hiding this comment

shoyer commented Jun 20, 2024

shoyer commented Jun 20, 2024

TomNicholas commented Jun 20, 2024

shoyer commented Jun 20, 2024

shoyer commented Jun 21, 2024

TomNicholas commented Jun 25, 2024

TomNicholas commented Jun 25, 2024 • edited Loading

TomNicholas commented Jun 28, 2024

TomNicholas commented Jun 29, 2024 • edited Loading

shoyer commented Jun 30, 2024

shoyer commented Jun 30, 2024

shoyer commented Jun 30, 2024

shoyer Jun 30, 2024

Choose a reason for hiding this comment

shoyer commented Jun 30, 2024 • edited Loading

TomNicholas left a comment

Choose a reason for hiding this comment

flamingbear commented Jul 1, 2024 • edited Loading

shoyer commented Jul 1, 2024

flamingbear left a comment

Choose a reason for hiding this comment

keewis left a comment • edited Loading

Choose a reason for hiding this comment

shoyer commented Jul 3, 2024

TomNicholas commented Jul 3, 2024 • edited Loading

shoyer commented Jul 3, 2024

TomNicholas commented Jun 25, 2024 •

edited

Loading

TomNicholas commented Jun 29, 2024 •

edited

Loading

shoyer commented Jun 30, 2024 •

edited

Loading

flamingbear commented Jul 1, 2024 •

edited

Loading

keewis left a comment •

edited

Loading

TomNicholas commented Jul 3, 2024 •

edited

Loading