Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding open_groups to BackendEntryPointEngine, NetCDF4BackendEntrypoint, and H5netcdfBackendEntrypoint #9243

Merged
merged 33 commits into from
Aug 14, 2024

Conversation

eni-awowale
Copy link
Collaborator

@eni-awowale eni-awowale commented Jul 13, 2024

@TomNicholas, @shoyer, @owenlittlejohns, and @flamingbear

@eni-awowale eni-awowale marked this pull request as draft July 13, 2024 22:03
@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Jul 18, 2024
xarray/backends/api.py Outdated Show resolved Hide resolved
@eni-awowale
Copy link
Collaborator Author

eni-awowale commented Jul 26, 2024

I know we touched on this a bit during our meeting on Tuesday but I think tests for open_groups in netCDF_.py and h5netcdf.py should probably go in test_backends_datatree.py. We mentioned we could create separate test files for it but it looks like test_backends_datatree.py, has an empty class for each of these backends and it looks like they would fit nicely in here.

test_backends_datatree.py

@requires_netCDF4
class TestNetCDF4DatatreeIO(DatatreeIOBase):
    engine: T_DataTreeNetcdfEngine | None = "netcdf4"


@requires_h5netcdf
class TestH5NetCDFDatatreeIO(DatatreeIOBase):
    engine: T_DataTreeNetcdfEngine | None = "h5netcdf"

Let me know if I have that right and what you all think.
@TomNicholas, @shoyer, @keewis, @flamingbear, @owenlittlejohns

EDIT:
I think we could also add a couple more tests for open_datatree for those specific engines to those classes.

EDIT 2:
If it's okay with everyone I think we should also include a small netCDF file in test/data that has groups. example_1.nc and bears.nc don't have groups.

@TomNicholas
Copy link
Contributor

Sounds good @eni-awowale

You may even be able to use basically the same groups test for multiple backends by adding it to DataTreeIOBase...

If it's okay with everyone I think we should also include a small netCDF file in test/data that has groups.

Having an included file to test against is a good idea, we just want it to be small.

example_1.nc and bears.nc don't have groups.

Where are these currently? We don't want to add files to the main repo - preferring to put them here https://github.com/pydata/xarray-data

@eni-awowale
Copy link
Collaborator Author

You may even be able to use basically the same groups test for multiple backends by adding it to DataTreeIOBase...

So, instead of adding the same type of tests to TestNetCDF4DatatreeIO and TestH5NetCDFDatatreeIO I can add them directly to DataTreeIOBase, if the are not engine specific? Okay, that makes sense!

@TomNicholas example_1.nc and bears.nc are here test/data.

We don't want to add files to the main repo - preferring to put them here

Okay, great there might already be some some sample files in there with groups that I can use!

@TomNicholas
Copy link
Contributor

@TomNicholas example_1.nc and bears.nc are here test/data.

I didn't even know those existed... @max-sixty I know you have thought about example datasets for testing purposes - do you have an opinion on whether new files for testing should go in that directory or the separate repository?

@shoyer
Copy link
Member

shoyer commented Jul 26, 2024

@TomNicholas example_1.nc and bears.nc are here test/data.

I didn't even know those existed... @max-sixty I know you have thought about example datasets for testing purposes - do you have an opinion on whether new files for testing should go in that directory or the separate repository?

We want tests to be runable without a network connection. If we need new files for testing, which I agree would be a good idea in this case, please add them here. Just keep them as small as possible (the current test files are all under 10 KB in size, most under 1 KB).

@shoyer
Copy link
Member

shoyer commented Jul 26, 2024

Where are these currently? We don't want to add files to the main repo - preferring to put them here https://github.com/pydata/xarray-data

xarray-data is for sample/tutorial datasets, which can be much larger (enough data to make an interesting plot) than what we use for test data.

@eni-awowale eni-awowale marked this pull request as ready for review July 30, 2024 14:30
xarray/backends/h5netcdf_.py Outdated Show resolved Hide resolved
xarray/backends/h5netcdf_.py Show resolved Hide resolved
xarray/backends/plugins.py Show resolved Hide resolved
xarray/tests/test_backends_datatree.py Outdated Show resolved Hide resolved
xarray/tests/test_backends_datatree.py Outdated Show resolved Hide resolved
xarray/backends/common.py Outdated Show resolved Hide resolved
xarray/backends/common.py Outdated Show resolved Hide resolved
Comment on lines +472 to +476
invalid_netcdf=None,
phony_dims=None,
decode_vlen_strings=True,
driver=None,
driver_kwds=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These shouldn't be here right? They should all fall under `**kwargs``

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe they should be in the specific backend but not in common.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so these were added from PR #9199 for adding back the backend specific keyword arguments. I pulled this into my branch after it was merged to main. But they are not in common.py , they are consolidated as **kwargs in common.py.

@@ -466,19 +494,23 @@ def open_datatree(
driver=driver,
driver_kwds=driver_kwds,
)
# Check for a group and make it a parent if it exists
if group:
parent = NodePath("/") / NodePath(group)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eni-awowale this is how you should join paths

@@ -837,6 +837,43 @@ def open_datatree(
return backend.open_datatree(filename_or_obj, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have a default implementation here that calls open_groups, i.e.

Suggested change
return backend.open_datatree(filename_or_obj, **kwargs)
groups_dict = backend.open_datatree(filename_or_obj, **kwargs)
return DataTree.from_dict(groups_dict)

The idea being that then backend developers don't actually have to implement open_datatree if they have implemented open_groups...

This was sort of discussed here (@keewis) #7437 (comment), but this seems like an rabbit hole that should be left for a future PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really, I was arguing that given any one of open_dataarray, open_dataset and open_datatree allows us to provide (somewhat inefficient) default implementations for the others. However, open_groups has a much closer relationship to open_datatree, so I think having a default implementation for open_datatree is fine (we just need to make sure that a backend that provides neither open_groups nor open_datatree doesn't complain about open_groups not existing if you called open_datatree).

So yeah, this might become a rabbit hole.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I see. That seems related, but also like a totally optional convenience feature that we should defer to later.

@eni-awowale
Copy link
Collaborator Author

Hi folks! I think this PR is a couple commits away from merging but they are a couple things to sort out.

mypy checks failing

The open_groups function returns a dict[str, Dataset] type. However, the return type for DataTree.from_dict() is a MutableMapping[str, Dataset | DataArray | DataTree | None] and we call DataTree.from_dict() in our open_datatree() backend functions, so mypy flags this as having an incompatible type. See below.
"from_dict" of "DataTree" has incompatible type "dict[str, Dataset]"; expected "MutableMapping[str, Dataset | DataArray | DataTree[Any] | None]" [arg-type]
I am not sure exactly how to proceed with this since updating the return type of DataTree.from_dict() causes other mypy checks to fail because the are expecting the MutuableMapping return type.

Some checks are failing but seem unrelated to the PR changes?

I noticed there are tests failing, but these seem to be consistent with the checks failing in main and are related to test_computations.py. The other test I see failing are the ubuntu 3.12 flaky tests. Are these failures we can ignore?

Updating _iter_nc_groups

I added the suggestion from @keewis so _iter_nc_groups is returning the parent node and now we don't have to reopen the store twice 🎉

cc
@TomNicholas, @keewis, @shoyer, @flamingbear

@keewis
Copy link
Collaborator

keewis commented Aug 8, 2024

Are these failures we can ignore?

flaky is something that may fail at random, mostly because of server or network issues. Usually it will pass again on the next run (I reran it for you, let's see what happens). The other test failures are indeed also on main, so we can ignore it here.

Edit: flaky passes now, so I believe you can ignore the remaining test failures

I am not sure exactly how to proceed with this since updating the return type of DataTree.from_dict() causes other mypy checks to fail because the are expecting the MutableMapping return type.

Maybe the typing crowd can help? cc @max-sixty, @Illviljan, @headtr1ck

@Illviljan
Copy link
Contributor

mypy checks failing

The open_groups function returns a dict[str, Dataset] type. However, the return type for DataTree.from_dict() is a MutableMapping[str, Dataset | DataArray | DataTree | None] and we call DataTree.from_dict() in our open_datatree() backend functions, so mypy flags this as having an incompatible type. See below. "from_dict" of "DataTree" has incompatible type "dict[str, Dataset]"; expected "MutableMapping[str, Dataset | DataArray | DataTree[Any] | None]" [arg-type] I am not sure exactly how to proceed with this since updating the return type of DataTree.from_dict() causes other mypy checks to fail because the are expecting the MutuableMapping return type.

It's this type of problem: https://stackoverflow.com/questions/73603289/why-doesnt-parameter-type-dictstr-unionstr-int-accept-value-of-type-di
from_dict should probably switch from MutableMapping to Mapping instead.

xarray/tests/test_backends_datatree.py Outdated Show resolved Hide resolved
xarray/tests/test_backends_datatree.py Outdated Show resolved Hide resolved
xarray/tests/test_backends_datatree.py Outdated Show resolved Hide resolved
Comment on lines 1099 to 1100
d_cast = cast(dict, d)
root_data = d_cast.pop("/", None)
Copy link
Contributor

@Illviljan Illviljan Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mypy is correct here, a Mapping does not include a .pop and ignoring the typing errors doesn't solve the bug.

xarray uses Mappings frequently, for example xr.core.utils.FrozenDict(dict(a=3)).pop("a", None) so it's a real issue.

Either you explicitly convert it to dict i.e. d_cast = dict(d) or refactor the code to not use the .pop since I'm not so sure it's needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I updated it to explicitly covert the type to a dict. The mypy3.9 tests are still passing but the CI Mypy check seems to be returning the same error as before explicitly converting it to a dict.

@eni-awowale
Copy link
Collaborator Author

@TomNicholas and @keewis this issue we just moved over #9336 seems to be related to the latest batch of test failures.

@eni-awowale
Copy link
Collaborator Author

Yay all checks are passing! Does someone want to give this a quick look before merging?
cc
@TomNicholas, @keewis, @owenlittlejohns, @shoyer and @flamingbear

def open_groups_as_dict(
self,
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
**kwargs: Any,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should not make the same mistake as with open_dataset and prevent liskov errors.

Suggested change
**kwargs: Any,

If the abstract method supports any kwargs, so must all subclass implementations, which is not what we want.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@headtr1ck from my understanding I think the **kwargs were added back to fix this issue #9135

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, Not sure.
But since this is the same problem in all other backend methods, I'm fine with leaving it as it is (and possibly change it in a future PR all together).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good we can revisit this on another PR.

decode_vlen_strings=True,
driver=None,
driver_kwds=None,
**kwargs,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be obsolete as well, when you remove it from the abstract method.

persist=False,
lock=None,
autoclose=False,
**kwargs,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@eni-awowale eni-awowale merged commit 3c19231 into pydata:main Aug 14, 2024
28 checks passed
dcherian added a commit to dcherian/xarray that referenced this pull request Aug 17, 2024
* main:
  Improve error message for missing coordinate index (pydata#9370)
  Add flaky to TestNetCDF4ViaDaskData (pydata#9373)
  Make chunk manager an option in `set_options` (pydata#9362)
  Revise (pydata#9371)
  Remove duplicate word from docs (pydata#9367)
  Adding open_groups to BackendEntryPointEngine, NetCDF4BackendEntrypoint, and H5netcdfBackendEntrypoint (pydata#9243)
dcherian added a commit to TomNicholas/xarray that referenced this pull request Aug 22, 2024
* main: (214 commits)
  Adds copy parameter to __array__ for numpy 2.0 (pydata#9393)
  `numpy 2` compatibility in the `pydap` backend (pydata#9391)
  pyarrow dependency added to doc environment (pydata#9394)
  Extend padding functionalities (pydata#9353)
  refactor GroupBy internals (pydata#9389)
  Combine `UnsignedIntegerCoder` and `CFMaskCoder` (pydata#9274)
  passing missing parameters to ZarrStore.open_store when opening a datatree (pydata#9377)
  Fix tests on big-endian systems (pydata#9380)
  Improve error message on `ds['x', 'y']` (pydata#9375)
  Improve error message for missing coordinate index (pydata#9370)
  Add flaky to TestNetCDF4ViaDaskData (pydata#9373)
  Make chunk manager an option in `set_options` (pydata#9362)
  Revise (pydata#9371)
  Remove duplicate word from docs (pydata#9367)
  Adding open_groups to BackendEntryPointEngine, NetCDF4BackendEntrypoint, and H5netcdfBackendEntrypoint (pydata#9243)
  Revise (pydata#9366)
  Fix rechunking to a frequency with empty bins. (pydata#9364)
  whats-new entry for dropping python 3.9 (pydata#9359)
  drop support for `python=3.9` (pydata#8937)
  Revise (pydata#9357)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design io topic-backends topic-DataTree Related to the implementation of a DataTree class
Projects
Development

Successfully merging this pull request may close these issues.

open_dict_of_datasets function to open any file containing nested groups
8 participants