Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hierarchical coordinates in DataTree #9063

Merged
merged 25 commits into from
Jul 3, 2024
Merged

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Jun 3, 2024

Proof of concept implementation of #9056

Still probably needs more tests, suggestions welcome!

It's also possible that I've implemented this in a rather inelegant fashion, I am still wrapping my head around the tree-node abstraction.

CC @TomNicholas, @keewis, @owenlittlejohns, @flamingbear, @eni-awowale

@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Jun 4, 2024
@TomNicholas TomNicholas added this to In progress in DataTree integration via automation Jun 4, 2024
@TomNicholas
Copy link
Contributor

This is cool to see! And it shows that simply calling align on the data in parent nodes can be neat way to perform the check for inheritability.

However this PR doesn't actually implement the suggestion in #9056, because it places a new strong restriction on the datatree data model. As well as restricting the range of valid DataTree objects, it would actually mean that there are netCDF files that DataTree objects cannot represent:

# using `main`
import xarray as xr
from xarray.core.datatree import DataTree

# create the unalignable tree from your tests
dt = DataTree.from_dict(
    {
        "/": xr.Dataset({"a": (("x",), [1, 2])}),
        "/b": xr.Dataset({"c": (("x",), [3])}),
    }
)

# write a valid netCDF file...
dt.to_netcdf('test.nc')

# switch branch to `datatree-hierarchy`
from xarray.backends.api import open_datatree

dt2 = open_datatree('test.nc')  # will raise an error from align
ValueError: inconsistent alignment between node and parent datasets:

The alternative suggestion I made in #9056 was instead to provide an API which performs inheritance like this, but doesn't apply that restriction to all trees at construction time.

I was working on my own PR in #9056 which does a similar alignment check but only when the user calls a specific .inherit method. The downside of my approach is that the alignment would be checked every time the user calls .inherit, which is really inefficient:

dt.inherit['x']  # calls `align` once
dt.inherit['x']  # calls `align` again!

I think what we actually want is somewhere between the two: we want to calculate all the inheritable coordinates at tree construction time like you've done, but not actually enforce that by raising errors from align. Instead we cache the list of which variables are inheritable (again like you've done), but keep them hidden until we merge them into the node once the user calls .inherit (like I've tried to do).

Comment on lines 107 to 110
def _check_alignment(
node_ds: Dataset,
parent_ds: Dataset | None,
children: Mapping[Hashable, DataTree],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _check_alignment(
node_ds: Dataset,
parent_ds: Dataset | None,
children: Mapping[Hashable, DataTree],
def _find_alignable_coordinates(
node_ds: Dataset,
parent_ds: Dataset | None,
) -> Coordinates:

Then we would store those coordinates on the node at construction time but not use them unless the user calls .inherit.

@TomNicholas
Copy link
Contributor

However this PR doesn't actually implement the suggestion in #9056, because it places a new strong restriction on the datatree data model.

I withdraw this complaint 😁 - see #9077 for discussion as to whether or not to impose this contraint.


@shoyer this suggestion might make no sense, but IIUC currently in this PR inherited variables & indexes are literally copied onto child nodes, relying on xarray's shallow copying features to not duplicate any underlying data. What I wonder is whether instead we could simply store the list of names of the inherited variables on the child node as ._inherited_var_names: set[str], and anytime we want access to them we simply look back up the tree until we find it. With a strictly-aligned model we can be sure that they definitely do exist further up the tree.

We could hide this detail by having properties like .variables and ._local_variables or similar, so that code which accesses these variables doesn't have to be changed. But then the model is a bit simpler as we're only storing string keys to represent the inherited variables (it's a little more like how coord_names is internally just a set[Hashable]).

The main tricky part I see with this approach is making sure the correct variable is altered when you try to write to an inherited variable, but I think we have to handle that carefully with either approach.

@@ -480,6 +489,7 @@ def to_dataset(self) -> Dataset:
--------
DataTree.ds
"""
# TODO: copy these container objects?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: We definitely need tests to verify the to_dataset() creates an new copy of the node's data. (I suspect that current it does not.)

@shoyer
Copy link
Member Author

shoyer commented Jun 20, 2024

@TomNicholas CI is passing now, so this should be in a reasonable state for you to take a look!

@shoyer
Copy link
Member Author

shoyer commented Jun 20, 2024

@shoyer this suggestion might make no sense, but IIUC currently in this PR inherited variables & indexes are literally copied onto child nodes, relying on xarray's shallow copying features to not duplicate any underlying data. What I wonder is whether instead we could simply store the list of names of the inherited variables on the child node as ._inherited_var_names: set[str], and anytime we want access to them we simply look back up the tree until we find it. With a strictly-aligned model we can be sure that they definitely do exist further up the tree.

Indeed, this is how I implemented it currently. Can you think of use-cases for distinguishing node vs inherited data? I guess this could be a nice option for .to_dataset() and for writing data to Zarr?

@TomNicholas
Copy link
Contributor

Oh right! I think we might need to be able to distinguish internally when doing map_over_subtree

@shoyer
Copy link
Member Author

shoyer commented Jun 20, 2024

Oh right! I think we might need to be able to distinguish internally when doing map_over_subtree

I agree, it seems pretty clear that we need support for pulling out "node local" versions of datasets, without inheritance.

For this PR, I'll implement a keyword option on DataTree.to_dataset to allow for only pulling out these local variables. This might be spelled local=True or inherited=False, and will ensure that we have the right information in the DataTree data model.

I'll also add a test to verify that to_zarr() works properly, without duplicating coordinates.

@shoyer
Copy link
Member Author

shoyer commented Jun 21, 2024

I implemented to_dataset(local=True). Would love feedback!

One messy thing is that it appears that assignment via .coords is broken on DataTree even at main. No error is raised when you try to set coordinates, but they don't actually get changed, e.g.,

In [1]: from xarray.core.datatree import DataTree
   ...: import xarray as xr
   ...:
   ...: tree = DataTree.from_dict(
   ...:     {
   ...:         "/": xr.Dataset({"e": (("x",), [1, 2])}, coords={"x": [2, 3]}),
   ...:         "/b": xr.Dataset({"f": (("y",), [3])}),
   ...:         "/b/c": xr.Dataset(),
   ...:         "/b/d": xr.Dataset({"g": 4}),
   ...:     }
   ...: )

In [2]: tree
Out[2]:
<xarray.DataTree>
Group: /
│   Dimensions:  (x: 2)
│   Coordinates:
│     * x        (x) int64 16B 2 3
│   Data variables:
│       e        (x) int64 16B 1 2
└── Group: /b
    │   Dimensions:  (y: 1)
    │   Dimensions without coordinates: y
    │   Data variables:
    │       f        (y) int64 8B 3
    ├── Group: /b/c
    └── Group: /b/d
            Dimensions:  ()
            Data variables:
                g        int64 8B 4

In [3]: tree.coords['y'] = 5

In [4]: tree
Out[4]:
<xarray.DataTree>
Group: /
│   Dimensions:  (x: 2)
│   Coordinates:
│     * x        (x) int64 16B 2 3
│   Data variables:
│       *empty*
└── Group: /b
    │   Dimensions:  (y: 1)
    │   Dimensions without coordinates: y
    │   Data variables:
    │       f        (y) int64 8B 3
    ├── Group: /b/c
    └── Group: /b/d
            Dimensions:  ()
            Data variables:
                g        int64 8B 4

So I'm not quite sure how to test modifying coordinates yet. Maybe we'll leave that for a follow-up PR.

@TomNicholas
Copy link
Contributor

One messy thing is that it appears that assignment via .coords is broken on DataTree even at main.

Ah yes - I forgot there's a TODO deep somewhere for that 😅 I left it because it for later because it seemed like it might require changing the DatasetCoordinates class that ds.coords returns. (Surely that class could just be replaced by the new xr.Coordinates class now??)

@TomNicholas
Copy link
Contributor

TomNicholas commented Jun 25, 2024

What I wonder is whether instead we could simply store the list of names of the inherited variables on the child node as ._inherited_var_names: set[str], and anytime we want access to them we simply look back up the tree until we find it. With a strictly-aligned model we can be sure that they definitely do exist further up the tree.

Indeed, this is how I implemented it currently.

Hmm, looking again I'm not sure this is true. You have the attributes defining the DataTree node as

class DataTree:
    _name: str | None
    _parent: DataTree | None
    _children: dict[str, DataTree]
    _cache: dict[str, Any]  # used by _CachedAccessor
    _local_variables: dict[Hashable, Variable]
    _local_coord_names: set[Hashable]
    _local_dims: dict[Hashable, int]
    _local_indexes: dict[Hashable, Index]
    _variables: dict[Hashable, Variable]
    _coord_names: set[Hashable]
    _dims: dict[Hashable, int]
    _indexes: dict[Hashable, Index]
    _attrs: dict[Hashable, Any] | None
    _encoding: dict[Hashable, Any] | None
    _close: Callable[[], None] | None

i.e. every local variable (and index) are explicitly stored on every node, and every inherited variable (and index) are also stored explicitly on every node. What I was suggesting was something more like

class DataTree
    _name: str | None
    _parent: DataTree | None
    _children: dict[str, DataTree]
    _cache: dict[str, Any]  # used by _CachedAccessor
    _local_variables: dict[Hashable, Variable]  # these 4 attributes we already have, they're  just renamed
    _local_coord_names: set[Hashable]
    _local_dims: dict[Hashable, int]
    _local_indexes: dict[Hashable, Index]
    _inherited_coord_names: set[Hashable]  # note this is now the only new attribute needed compared to current non-inheriting model
    _attrs: dict[Hashable, Any] | None
    _encoding: dict[Hashable, Any] | None
    _close: Callable[[], None] | None

    @property
    def _inherited_coords(self) -> dict[Hashable, Variable]:
        # we know these exist somewhere further up the tree
        return {self._get_inherited_coord_from_above(coord_name) for coord_name in self._inherited_coord_names}

    @property
    def variables(self) -> dict[Hashable, Variable]:
        # we know these are all alignable because we checked it at construction time
        all_variables = {**self._local_variables, **self._inherited_coords}
        return all_variables

    @property
    def _dims(self) -> dict[Hashable, int]:
        """All dims present on local and inherited variables"""
        return ...

where the set of all variables if defined as a property that looks up the tree to fetch the correct inherited variables (so they aren't stored twice).

Basically the point is that the only extra thing I need to know is the names of the variables to inherit, and because I know they definitely exist, I can just generate any combination of local/inherited/all variables dynamically. With this design there are more properties that get built on-the-fly, but there are fewer internal attributes that must be kept consistent with one another.

@TomNicholas
Copy link
Contributor

OK, so here's the tricky case: How should we handle checking alignment when there is an inherited dimension that is not found on a direct parent dataset?

Ah, now I realise that this is probably why my grandchild tests in #9187 don't yet work...

This does seem tricky. IIUC the basic problem is that _check_alignment currently checks that local variables align with the local variables in the immediate parent node, then recursively calls itself again to check that variables on each child node align with those on the current node. At no point does it do the alignment check between two nodes that are not immediately adjacent, so if alignment information is not propagated, the chain is broken.

Other ideas for solving this:

  1. Check alignment between every node and every other node - would work, but would be combinatorially expensive so clearly a terrible idea.
  2. Check alignment between local variables and all the variables on the parent, including inherited variables. Inheritance will effectively populate those empty nodes, solving the issue. However if there the alignment check raises it then becomes really hard to generate a diff showing where the offending variables were defined. (Also not totally sure that this isn't circular logic...)

@TomNicholas
Copy link
Contributor

TomNicholas commented Jun 29, 2024

How should we handle checking alignment when there is an inherited dimension that is not found on a direct parent dataset?

Thinking about this more I have two questions:

  1. Are we checking alignment against all variables (data and coordinate), but only propagating down coordinate variables? Couldn't that lead to weird situations where a child dataset that aligns with a grandparent's coordinate variables but not with its grandparent's data variables doesn't raise an error? (because the grandparents' data variables aren't inherited down to the empty parent node). I guess what I'm getting at here is being clearer on the difference between inheriting all variables to do alignment checks, and inheriting just coordinate variables in order to make them accessible from child groups.

  2. Does it even make sense for _check_alignment to raise a ValueError of the form
    "group {path1} is not aligned with its parent: see {repr(group1)} vs {repr(group2)}"? The alignment criterion is that the current node aligns with the parent node including all coordinates inherited onto that parent node. But those inherited coordinates could have come from multiple different ancestor groups. So a single alignment error message could actually have happened due to variables in two different ancestor groups. For example:

tree = DataTree.from_dict(
    {
        "/": xr.Dataset({"x": (("x",), [1, 2])}),
        "/b": xr.Dataset({"y": (("y",), [1])}),
        "/b/c": xr.Dataset({"d": (("x", "y"), [[3, 4]])}),  # d has size {'x': 1, 'y': 2}, so doesn't align with either parent group.
    }
)

Here "d" doesn't align with "x" or "y", but you can't really say that the alignment error is caused by group "/b" misaligning with group "b/c" or by group "/"misaligning with group"/b/c"`, it's caused by both! It would make much more sense to say that group "/b/c" does not align with its parent group including all variables inherited onto the parent.

@shoyer
Copy link
Member Author

shoyer commented Jun 30, 2024

  • The Variable objects in _coord_variables will still be pointed to from multiple places

I see this as a pro, rather than a con. Using the exact same Variable objects means it is possible to modify metadata like attrs in-place .

  • Our list of internal attributes on a DataTree node is still not just those on a Dataset plus the inherited coordinates

This is indeed a bit of a con, but in my mind the right fix is probably to adjust the Dataset data model to using dictionaries of data_variables and coord_variables, rather than the current solution of a dict of variables and a set of coord_names. Using a separate dictionary for coord_variables would also be more aligned with how DataArray is implemented. The internal Dataset data model is a hold-over from the very early days of Xarray, before we had a notion of coordinate variables that are not indexes.

  • The ChainMap of coordinates is potentially storing irrelevant information, i.e. anything that gets overridden. Specifically
dt = DataTree.from_dict(
    {
        "/a": xr.Dataset({"x": (("x",), [1, 2])}),  # x coordinate of size 2
        "/a/b": xr.Dataset({"x": (("x",), [3, 4])}),  # x would be inherited, but gets overridden (as this new x coordinate still aligns)
    }
)

Because a ChainMap doesn't overwrite keys, just obscures them, the object dt['/a/b'] now stores a ChainMap which has a reference to the coordinate ('x', [1, 2]). This is despite us knowing at tree construction time that this original x coordinate will never be used or even accessible by the dt['/a/b'] node.

This particular example would actually be an alignment error (x is used as an index), but you are correct that this sort of scenario could (rarely) happen. Specifically, it could happen for coordinate variables that are not indexes, e.g.,

dt = DataTree.from_dict(
    {
        "/a": xr.Dataset(coords={"y": (("x",), [1, 2])}),
        "/a/b": xr.Dataset(coords={"y": (("x",), [3, 4])}),  # y would be overridden
    }
)

I don't really see this as a problem, though. It's true that the higher-level coordinates are not currently accessible, but they are still around if the coordinate on the lower-level group is modified.

I guess I still don't really understand why we are trying to implement this inheritance by storing all the things that might have been inherited (or might have been overridden) in attributes on the local node. The ChainMap seems like a clever way to avoid iterateing upwards through .parents, but I'm not sure that means that every node should store a ChainMap.

On the contrary, basically all ChainMap does is systematize iterating upwards through .parents! I highly recommend taking a look under the hood of ChainMap -- it's quite simple.

Store every coordinate variable only once, keeping only the names of the alignable variables on the child node. Then we iterate up the tree to go fetch the variable only when we need it.

To be clear, my intention was quite similar -- the underlying (node-level) maps at each level of _coord_variables, _dims and _indexes should each exist exactly once. The ChainMap objects are just proxies that link them together.

That said, there's no reason why the ChainMap objects need to be constructed when the DataTree objects are built. Instead, it probably makes more sense to do this dynamically, to avoid a separate redundant representation of the tree structure, e.g.,

@property
def _chained_coord_variables(self):
    maps = [self._coord_variables]
    maps.extend(p._coord_variables for p in parents)
    return ChainMap(*maps)

However, it turns out we can create Dataset objects with extra unused dimensions (via direct constructors), and everything works in align() as you would expect.
My current plan is to do just that: create technically invalid Dataset objects for purposes of checking alignment

This also seems potentially fragile? 😕

It's a little fragile, but these objects never get exposed to users, so the risk is quite contained.

@shoyer
Copy link
Member Author

shoyer commented Jun 30, 2024

  1. Check alignment between local variables and all the variables on the parent, including inherited variables. Inheritance will effectively populate those empty nodes, solving the issue. However if there the alignment check raises it then becomes really hard to generate a diff showing where the offending variables were defined. (Also not totally sure that this isn't circular logic...)

This is the solution I have currently implemented. I think it's OK -- the inherited coordinate variables are listed on the repr of the parent node -- though indeed it's not as clear as it could be. Ideally we would say exactly at which level all the conflicting coordinates came from.

@shoyer
Copy link
Member Author

shoyer commented Jun 30, 2024

  1. Are we checking alignment against all variables (data and coordinate), but only propagating down coordinate variables?

We're checking alignment against inherited dimensions and indexes, which exactly what xarray.align() checks -- it doesn't worry about data variables at all, or coordinate variables that are not also stored in indexes.

  1. Does it even make sense for _check_alignment to raise a ValueError of the form
    "group {path1} is not aligned with its parent: see {repr(group1)} vs {repr(group2)}"? The alignment criterion is that the current node aligns with the parent node including all coordinates inherited onto that parent node. But those inherited coordinates could have come from multiple different ancestor groups.

Note that the repr for the parent includes inherited variables.

So a single alignment error message could actually have happened due to variables in two different ancestor groups. For example:

tree = DataTree.from_dict(
    {
        "/": xr.Dataset({"x": (("x",), [1, 2])}),
        "/b": xr.Dataset({"y": (("y",), [1])}),
        "/b/c": xr.Dataset({"d": (("x", "y"), [[3, 4]])}),  # d has size {'x': 1, 'y': 2}, so doesn't align with either parent group.
    }
)

Here "d" doesn't align with "x" or "y", but you can't really say that the alignment error is caused by group "/b" misaligning with group "b/c" or by group "/"misaligning with group"/b/c"`, it's caused by both! It would make much more sense to say that group "/b/c" does not align with its parent group including all variables inherited onto the parent.

The current error message (from evaluating your snippet on this branch) is:

ValueError: group '/b/c' is not aligned with its parent:
Group:
    Dimensions:  (x: 1, y: 2)
    Dimensions without coordinates: x, y
    Data variables:
        d        (x, y) int64 16B 3 4
Parent:
    Dimensions:  (x: 2, y: 1)
    Coordinates:
      * x        (x) int64 16B 1 2
      * y        (y) int64 8B 1

I think is pretty descriptive, though perhaps we could more clearly indicate that it includes all parent nodes, even just by switching "parent" to "parents".

Comment on lines +535 to +546
# Note: rebuild_dims=False can create technically invalid Dataset
# objects because it may not contain all dimensions on its direct
# member variables, e.g., consider:
# tree = DataTree.from_dict(
# {
# "/": xr.Dataset({"a": (("x",), [1, 2])}), # x has size 2
# "/b/c": xr.Dataset({"d": (("x",), [3])}), # x has size1
# }
# )
# However, they are fine for internal use cases, for align() or
# building a repr().
dims = dict(self._dims)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomNicholas This is where we make the invalid Dataset objects, though they are also propagated by _inherited_dataset() (if one of the inputs is invalid).

@shoyer
Copy link
Member Author

shoyer commented Jun 30, 2024

That said, there's no reason why the ChainMap objects need to be constructed when the DataTree objects are built. Instead, it probably makes more sense to do this dynamically, to avoid a separate redundant representation of the tree structure, e.g.,

@property
def _chained_coord_variables(self):
    maps = [self._coord_variables]
    maps.extend(p._coord_variables for p in parents)
    return ChainMap(*maps)

This is now implemented -- what do you think?

I think this captures of the essence of what you were aiming towards with #9187. Conveniently, it also simplifies the implementation of DataTree.

Copy link
Contributor

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough reply here @shoyer ! I think my main concerns are now addressed. 🙏

I'm going to approve, though I think there are a couple of little nits that could be neater. I would also love it if @owenlittlejohns / @flamingbear / @eni-awowale could at least take a look and say if they are following the general gist of what's being done here 😁

The Variable objects in _coord_variables will still be pointed to from multiple places

I see this as a pro, rather than a con. Using the exact same Variable objects means it is possible to modify metadata like attrs in-place .

👍 I think my main concern here was really about having two different representations of the tree at once, and that's dealt with now.

Our list of internal attributes on a DataTree node is still not just those on a Dataset plus the inherited coordinates

This is indeed a bit of a con, but in my mind the right fix is probably to adjust the Dataset data model to using dictionaries of data_variables and coord_variables, rather than the current solution of a dict of variables and a set of coord_names.

Interesting! I've split out #9203 to track this idea.

The ChainMap objects are just proxies that link them together.

That said, there's no reason why the ChainMap objects need to be constructed when the DataTree objects are built. Instead, it probably makes more sense to do this dynamically, to avoid a separate redundant representation of the tree structure

This is now implemented -- what do you think?

I think this captures of the essence of what you were aiming towards with #9187. Conveniently, it also simplifies the implementation of DataTree.

Thank you - I think this makes the code a lot easier to understand, both in terms of how the coordinates that go into constructing .ds are found, and in terms of the internal attributes of a single DataTree node being simpler.

We're checking alignment against inherited dimensions and indexes, which exactly what xarray.align() checks -- it doesn't worry about data variables at all, or coordinate variables that are not also stored in indexes.

Okay great. We should make this super clear in the documentation.

Note that the repr for the parent includes inherited variables.

[this error message when grandparents don't align] I think is pretty descriptive, though perhaps we could more clearly indicate that it includes all parent nodes, even just by switching "parent" to "parents".

👍 let's try to indicate that the group represent coordinates from the entire tree above.

[having intermediate datasets that are technically invalid is] a little fragile, but these objects never get exposed to users, so the risk is quite contained.

This still makes me a little worried - I would rather have structures that always obey their own contracts by design, but I agree its a problem that we can completely isolate from the users.

xarray/core/datatree.py Outdated Show resolved Hide resolved
xarray/core/datatree.py Outdated Show resolved Hide resolved
xarray/core/datatree.py Outdated Show resolved Hide resolved
xarray/core/datatree.py Show resolved Hide resolved
xarray/core/datatree.py Show resolved Hide resolved
xarray/core/datatree.py Show resolved Hide resolved
xarray/core/datatree.py Show resolved Hide resolved
xarray/core/treenode.py Show resolved Hide resolved
xarray/core/treenode.py Outdated Show resolved Hide resolved
@flamingbear
Copy link
Contributor

flamingbear commented Jul 1, 2024

say if they are following the general gist of what's being done here 😁

I have been following along, I may be a bit behind. I do like the properties for _coord_variables, _dims and _indexes, I think that captures what Tom was going for in a really clean way.

I have at least one dumb question I will ask inline. Usually I have to actually ask the question before I figure it out.

@shoyer
Copy link
Member Author

shoyer commented Jul 1, 2024

👍 let's try to indicate that the group represent coordinates from the entire tree above.

Here's what you currently see, which I think is pretty clear:

ValueError: group '/b' is not aligned with its parents:
Group:
    Dimensions:  (x: 1)
    Dimensions without coordinates: x
    Data variables:
        c        (x) float64 8B 3.0
From parents:
    Dimensions:  (x: 2)
    Dimensions without coordinates: x

[having intermediate datasets that are technically invalid is] a little fragile, but these objects never get exposed to users, so the risk is quite contained.

This still makes me a little worried - I would rather have structures that always obey their own contracts by design, but I agree its a problem that we can completely isolate from the users.

I don't love this either but I think it's an expedient choice for now. DatasetView is also problematic for similar reasons -- maybe even worse because it's exposed to users.

Copy link
Contributor

@flamingbear flamingbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've followed this though the changes and agree with the decisions, and that we should capture some of the newer tweaks.

Copy link
Collaborator

@keewis keewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still somewhat uncomfortable with the idea of pulling out inherited coordinates by default when converting a node to a dataset – I would probably have kept the inheritance to DataTree objects by default. I guess my concern is that

xr.open_dataset(..., group="/a")

results in a dataset that is different from either of

xr.open_datatree(...)["/a"].ds
xr.open_datatree(...)["/a"].to_dataset()

which may be pretty confusing at first. This can probably be avoided by changing the repr to mark inherited variables as such (not sure how easy that would be, though, especially since at least the conversion to Dataset would lose that information).

However, other than that this looks good to me, and I don't think this concern is urgent enough to block this PR (and in any case I do realize I'm somewhat late in voicing this concern).

xarray/core/datatree.py Show resolved Hide resolved
@shoyer
Copy link
Member Author

shoyer commented Jul 3, 2024

I guess my concern is that

xr.open_dataset(..., group="/a")

results in a dataset that is different from either of

xr.open_datatree(...)["/a"].ds
xr.open_datatree(...)["/a"].to_dataset()

We are going to have a similar issue when we add group to open_datatree, too: open_datatree(..., group="/a") will be different from open_datatree(...)["/a"].

One way to solve this would be to add an inherited argument to open_dataset and open_datatree, but I'm not sure anybody would actually use it. It's probably better to just to document the discrepancy for now.

In practice, setting the group argument to either open_dataset or open_datatree is something I would expect users to will only do for very large files, in performance critical cases. For these power users, I'm OK with a few more surprising edge cases

@TomNicholas
Copy link
Contributor

TomNicholas commented Jul 3, 2024

@keewis that's a good point, though I'm not sure what we can do about it other than document the subtlety clearly.

Note that the proposed

xr.open_as_dict_of_datasets(...)["/a"]

in #9137 would behave the way you are expecting, i.e. the same as

xr.open_dataset(..., group="/a")

@shoyer shoyer merged commit a86c3ff into pydata:main Jul 3, 2024
28 checks passed
DataTree integration automation moved this from In progress to Done Jul 3, 2024
@shoyer
Copy link
Member Author

shoyer commented Jul 3, 2024

Thanks everyone for your reviews and feedback, especially @TomNicholas ! This definitely improved the final design.

I'm merging this for now so we can start testing and iterating upon it.

@shoyer shoyer deleted the datatree-hierarchy branch July 7, 2024 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-DataTree Related to the implementation of a DataTree class
Development

Successfully merging this pull request may close these issues.

None yet

4 participants