Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix zarr append dtype checks #6476

Merged
merged 10 commits into from
May 11, 2022
Merged

Fix zarr append dtype checks #6476

merged 10 commits into from
May 11, 2022

Conversation

cisaacstern
Copy link
Contributor

@cisaacstern cisaacstern commented Apr 12, 2022

@cisaacstern
Copy link
Contributor Author

This WIP should close #6345 when complete. I'll mark as Ready for review along with an explanatory comment when it's ready.

@cisaacstern
Copy link
Contributor Author

cisaacstern commented Apr 13, 2022

Here's a first pass at a solution for #6345. This is a new area for me, so I certainly look forward to feedback from those with more experience.

The issue identified in #6345 was lack of support for append-mode Zarr writes for variables with dtype "|S*" where * is a positive integer. IIUC (but I may not), this datatype represents fixed-length strings, where * is the length in characters.

The MRE at the top of the linked issue demonstrated this problem for the case in which a new variable is being added, to which @shoyer replied in #6345 (comment):

I think the original issue was that appending a fixed-width string could be a problem if the fixed-width does not match the width of the existing string dtype stored in Zarr. ... This obviously doesn't apply in this case, because you are adding an entirely new variable. So I guess the check could be removed in that case.

In this comment, Stephan also links to the motivating case for the datatype check, which was to prevent truncation of strings of a greater maximum length (e.g. <U5) when appended to an existing array of strings of a smaller maximum length (e.g. <U2).

In sum, it seems to me that the requirements for a fix to these datatype checks are as follows:

  1. Support append of new variables of dtype |S* and <U* regardless of their length value (represented by * here)
  2. Support append to existing variables of dtype |S* and <U*, if dtype of the variable to append matches existing dtype exactly (i.e. allow appending |S2 to |S2, <U3 to <U3, etc.)
  3. Raise an exception only when the user attempts to append length-specified string data (of type |S* or <U*) to an existing array of data with a different datatype (i.e. appending |S3 to |S2, <U5 to <U2, etc. is not allowed)

This PR accomplishes this by leaving the initial checks largely intact, with the following adjustments:

Existing checks This PR
Raise an error in every case if the datatype of the variable to append is not known to easily be appended (e.g. floats, etc.) If the datatype of the variable to append is not known to be easy to append, only raise an error if its datatype does not exactly match the datatype of the corresponding variable in the existing store
Opinionated about datatype regardless of whether the variable is present or not in the existing store Permissive of all datatypes if the user is adding an entirely new variable (because no potential incompatibility in this case)
Assumes variable will be easy to append if coding.strings.is_unicode_dtype evaluates to True Removes this assumption, because it turns out that xr.coding.strings.is_unicode_dtype(np.dtype("<U5")) evaluates to True, and this is one of the cases where we want to ensure exact length equality.

I've incorporated coverage for the above-listed requirements into the test suite. The way the datatype validation is now written, any datatype (not just |S* and <U*) that is not known to be easy-to-append will pass the check only if its type matches the datatype of the corresponding variable in the existing store. I'm not aware of what specific other datatypes may fall into this category, but assume that if they are problematic, an error will be raised eventually at the Zarr level.

Thank you in advance to reviewers of my first xarray PR. (Noting also that it looks like the docs build failure is common to other current PRs, and not specific to this PR.)

@cisaacstern cisaacstern marked this pull request as ready for review April 13, 2022 00:08
@TomNicholas TomNicholas added topic-zarr Related to zarr storage library needs review labels Apr 13, 2022
@max-sixty
Copy link
Collaborator

Hi @cisaacstern — thanks a lot and welcome to xarray!

This looks very coherent, as far as the context I have. Any thoughts from others who know the area better?

@max-sixty max-sixty added the plan to merge Final call for comments label Apr 15, 2022
@shoyer
Copy link
Member

shoyer commented Apr 15, 2022 via email

@dcherian dcherian removed the plan to merge Final call for comments label Apr 19, 2022
@cisaacstern
Copy link
Contributor Author

cisaacstern commented May 10, 2022

@shoyer just wanted to chime in with a bump to say that I'll greatly appreciate your review of this PR when you get a moment.

This fix is currently the only blocker for pangeo-forge/staged-recipes#120. (I've just confirmed today that installing xarray from this PR branch resolves the error there, as detailed in bullets 2-3 of pangeo-forge/staged-recipes#120 (comment).)

I'm sure you're quite busy, so just want to emphasize how much I appreciate your attention to this, whenever you get a chance.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Charles for your contribution here, your gentle reminder and your patience :)

This looks great to me. My only suggestion is to summarize some of the insights from your comments here into code, where they will be easier to find for future readers/editors.

):
# and not re.match('^bytes[1-9]+$', var.dtype.name)):
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a brief comment (e.g, based on your comment here: #6476 (comment)) to summarize why it's OK not to check these cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the review!

I've added the requested comment in 9f97f00

@dcherian
Copy link
Contributor

Thanks @cisaacstern this is a great first contribution. Hope to see you around again!

@dcherian dcherian merged commit 4a53e41 into pydata:main May 11, 2022
@cisaacstern
Copy link
Contributor Author

Thanks all for your mentorship on this! Excited to continue contributing.

dcherian added a commit to dcherian/xarray that referenced this pull request May 20, 2022
* main: (24 commits)
  Fix overflow issue in decode_cf_datetime for dtypes <= np.uint32 (pydata#6598)
  Enable flox in GroupBy and resample (pydata#5734)
  Add setuptools as dependency in ASV benchmark CI (pydata#6609)
  change polyval dim ordering (pydata#6601)
  re-add timedelta support for polyval (pydata#6599)
  Minor Dataset.map docstr clarification (pydata#6595)
  New inline_array kwarg for open_dataset (pydata#6566)
  Fix polyval overloads (pydata#6593)
  Restore old MultiIndex dropping behaviour (pydata#6592)
  [docs] add Dataset.assign_coords example (pydata#6336) (pydata#6558)
  Fix zarr append dtype checks (pydata#6476)
  Add missing space in exception message (pydata#6590)
  Doc Link to accessors list in extending-xarray.rst (pydata#6587)
  Fix Dataset/DataArray.isel with drop=True and scalar DataArray indexes (pydata#6579)
  Add some warnings about rechunking to the docs (pydata#6569)
  [pre-commit.ci] pre-commit autoupdate (pydata#6584)
  terminology.rst: fix link to Unidata's "netcdf_dataset_components" (pydata#6583)
  Allow string formatting of scalar DataArrays (pydata#5981)
  Fix mypy issues & reenable in tests (pydata#6581)
  polyval: Use Horner's algorithm + support chunked inputs (pydata#6548)
  ...
dcherian added a commit to headtr1ck/xarray that referenced this pull request May 20, 2022
commit 398f1b6
Author: dcherian <[email protected]>
Date:   Fri May 20 08:47:56 2022 -0600

    Backward compatibility dask

commit bde40e4
Merge: 0783df3 4cae8d0
Author: dcherian <[email protected]>
Date:   Fri May 20 07:54:48 2022 -0600

    Merge branch 'main' into dask-datetime-to-numeric

    * main:
      concatenate docs style (pydata#6621)
      Typing for open_dataset/array/mfdataset and to_netcdf/zarr (pydata#6612)
      {full,zeros,ones}_like typing (pydata#6611)

commit 0783df3
Merge: 5cff4f1 8de7061
Author: dcherian <[email protected]>
Date:   Sun May 15 21:03:50 2022 -0600

    Merge branch 'main' into dask-datetime-to-numeric

    * main: (24 commits)
      Fix overflow issue in decode_cf_datetime for dtypes <= np.uint32 (pydata#6598)
      Enable flox in GroupBy and resample (pydata#5734)
      Add setuptools as dependency in ASV benchmark CI (pydata#6609)
      change polyval dim ordering (pydata#6601)
      re-add timedelta support for polyval (pydata#6599)
      Minor Dataset.map docstr clarification (pydata#6595)
      New inline_array kwarg for open_dataset (pydata#6566)
      Fix polyval overloads (pydata#6593)
      Restore old MultiIndex dropping behaviour (pydata#6592)
      [docs] add Dataset.assign_coords example (pydata#6336) (pydata#6558)
      Fix zarr append dtype checks (pydata#6476)
      Add missing space in exception message (pydata#6590)
      Doc Link to accessors list in extending-xarray.rst (pydata#6587)
      Fix Dataset/DataArray.isel with drop=True and scalar DataArray indexes (pydata#6579)
      Add some warnings about rechunking to the docs (pydata#6569)
      [pre-commit.ci] pre-commit autoupdate (pydata#6584)
      terminology.rst: fix link to Unidata's "netcdf_dataset_components" (pydata#6583)
      Allow string formatting of scalar DataArrays (pydata#5981)
      Fix mypy issues & reenable in tests (pydata#6581)
      polyval: Use Horner's algorithm + support chunked inputs (pydata#6548)
      ...

commit 5cff4f1
Merge: dfe200d 6144c61
Author: Maximilian Roos <[email protected]>
Date:   Sun May 1 15:16:33 2022 -0700

    Merge branch 'main' into dask-datetime-to-numeric

commit dfe200d
Author: dcherian <[email protected]>
Date:   Sun May 1 11:04:03 2022 -0600

    Minor cleanup

commit 35ed378
Author: dcherian <[email protected]>
Date:   Sun May 1 10:57:36 2022 -0600

    Support dask arrays in datetime_to_numeric
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_zarr raises ValueError: Invalid dtype with mode='a' (but not with mode='w')
5 participants