Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_zarr removes global attributes in destination dataset #8755

Open
5 tasks done
pnorton-usgs opened this issue Feb 15, 2024 · 12 comments
Open
5 tasks done

to_zarr removes global attributes in destination dataset #8755

pnorton-usgs opened this issue Feb 15, 2024 · 12 comments
Labels
bug topic-zarr Related to zarr storage library

Comments

@pnorton-usgs
Copy link

What happened?

Adding new variables to a zarr dataset with to_zarr() always removes the existing global attributes. New global attributes in the source dataset are not always added to the destination dataset depending on how to_zarr() is called.

What did you expect to happen?

I would expect that existing global attributes would always be preserved. If there are new global attributes I would expect them to be added to the existing global attributes instead of replacing all existing global attributes.

Minimal Complete Verifiable Example

import xarray as xr
from pyproj import CRS

local_zarr = 'sample.zarr'

ds_sample = xr.tutorial.load_dataset("air_temperature")

# Make a local copy
ds_sample.to_zarr(local_zarr, mode='w')
ds_sample = xr.open_dataset(local_zarr, engine='zarr', 
                            backend_kwargs={'consolidated':True}, chunks={}, 
                            decode_coords=True)

# Create CRS metadata
crs_meta = CRS.from_epsg(4326).to_cf()

ds_new = xr.Dataset(data_vars={"crs": ([], 1, crs_meta)})
ds_new.attrs['note'] = 'please add this'

# Add all variables from ds_new to the zarr
# NOTE: This adds the new global attribute but also removes
#       all existing global attributes
ds_new.to_zarr(local_zarr, mode='a')

# Add selected variable(s) to zarr dataset
# NOTE: This does not copy new global attributes 
#       and removes all existing global attributes
# ds_new['crs'].to_zarr(local_zarr, mode='a')

# Re-open local zarr store
ds_sample = xr.open_dataset(local_zarr, engine='zarr', 
                            backend_kwargs={'consolidated':True}, chunks={}, 
                            decode_coords=True)

ds_sample

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:26:40) [Clang 14.0.6 ]
python-bits: 64
OS: Darwin
OS-release: 22.6.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1

xarray: 2024.1.1
pandas: 2.2.0
numpy: 1.26.4
scipy: 1.12.0
netCDF4: 1.6.0
pydap: installed
h5netcdf: 1.3.0
h5py: 3.8.0
Nio: None
zarr: 2.17.0
cftime: 1.6.3
nc_time_axis: None
iris: None
bottleneck: 1.3.7
dask: 2024.2.0
distributed: 2024.2.0
matplotlib: 3.8.2
cartopy: 0.22.0
seaborn: None
numbagg: None
fsspec: 2023.12.2
cupy: None
pint: 0.23
sparse: None
flox: None
numpy_groupies: None
setuptools: 69.0.3
pip: 24.0
conda: None
pytest: 8.0.0
mypy: None
IPython: 8.21.0
sphinx: None

@pnorton-usgs pnorton-usgs added bug needs triage Issue that has not been reviewed by xarray team member labels Feb 15, 2024
Copy link

welcome bot commented Feb 15, 2024

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@dcherian dcherian added topic-zarr Related to zarr storage library and removed needs triage Issue that has not been reviewed by xarray team member labels Feb 15, 2024
@NickleDave
Copy link

I'm not a maintainer, but ... not sure if this is a bug:
https://docs.xarray.dev/en/stable/getting-started-guide/faq.html#what-is-your-approach-to-metadata

An implication of this choice is that we do not propagate attrs through most operations unless explicitly flagged (some methods have a keep_attrs option, and there is a global flag, accessible with xarray.set_options(), for setting this to be always True or False). Similarly, xarray does not check for conflicts between attrs when combining arrays and datasets, unless explicitly requested with the option compat='identical'. The guiding principle is that metadata should not be allowed to get in the way.

Sorry if I'm adding noise to the issue, just commenting because I'm looking now at using to_zarr with a DataArray that will have attrs

@elyall
Copy link

elyall commented Apr 24, 2024

I believe this is a bug. The Modifying existing Zarr stores section of the documentation outlines ways to modify existing zarr stores and to limit what data is modified using mode, append_dim, and region. And yet dataset attrs is always overwritten which is unintuitive and is in fact propagating the local attrs to the saved file in an undesirable fashion. With mode='a', and I'm guessing also with mode='a-':

  • xarray.Dataset.to_zarr() overwrites the saved dataset attrs with its own, even when it has none.
  • xarray.DataArray.to_zarr() overwrites the saved dataset attrs with its own, which is {} since it's a DataArray and doesn't have Dataset attrs.

@dcherian
Copy link
Contributor

What's the desired behaviour here? Never update zarr.Group.attrs when appending?

@elyall
Copy link

elyall commented Apr 24, 2024

I'm not knowledgeable of the zarr API, but my desired behavior when appending is to not update the root (Dataset) attrs. @pnorton-usgs's example shows a use case where it seems they would like to append to and overwrite conflicts of root attrs, though this would require more processing and might not be desirable for use cases with large metadata.

However this is separate from dealing with variable (DataArray) attrs. When appending a new variable using mode='a' it is desired to have the added variables' attrs get added, as is the current case. Whereas when appending to or modifying an existing variable with mode='a-' and some combination of append_dim and region it is likely desirable to not touch the variables' attrs, though I'm not familiar with the status quo here.

@elyall
Copy link

elyall commented Apr 25, 2024

Looking briefly over the code it seems like we'd just need to change

if self._mode not in ["r", "r+"]:
self.set_attributes(attributes)

to include mode not in ["a", "a-"].

Variable attrs seems to be written here, which we can leave untouched:

zarr_array = _put_attrs(zarr_array, encoded_attrs)

@dcherian
Copy link
Contributor

dcherian commented Apr 25, 2024

I think not updating Group attrs when appending sounds right. It is consistent with the Array writing behaviour (it seems like).

Handling conflicts seems a bit too much complexity.

cc @rabernat @slevang @jhamman for opinions

@slevang
Copy link
Contributor

slevang commented Apr 27, 2024

With mode="a", I'm not sure this does match the array behavior. If you don't set append_dim, then array values are happily overwritten. So it seems to me like overwriting dataset attrs is at least sort of consistent? With mode="a" / append_dim, then the array op is more like an outer join, so in that case you might expect some conflict handling.

Personally I find mode="a" pretty confusing overall and try to avoid it whenever possible.

@elyall
Copy link

elyall commented Apr 29, 2024

@slevang the difference is with mode="a" only included arrays are overwritten, while xarray.Dataset attrs (i.e. root attrs) are always overwritten, even when empty.

Personally I find mode="a" pretty confusing

For an xarray.Dataset with multiple variables/arrays mode="a" is very useful. For instance when you have an analysis pipeline producing different independent variables/arrays at different times, or if you want to modify one step in the pipeline and then recompute and save just that output, it allows you to append/modify whole arrays easily.

When the file only has one variable/array, such as after performing xarray.DataArray.to_zarr(), mode="a" is essentially the same as mode="w".

@slevang
Copy link
Contributor

slevang commented Apr 29, 2024

Yeah I think that's a reasonable distinction.

Totally agree mode="a" is very useful, my point is just that we try to push a lot of different (and potentially surprising) functionality into it. You can append along a dimension, append new variables, overwrite existing variables, etc. You can use it with append_dim or region. Then this issue raises the question of attribute handling, and array vs dataset attrs. This is all a lot of stuff packed into one option for new (or experienced!) users to figure out.

@rabernat
Copy link
Contributor

Agree 💯 @slevang.

Seems like what is needed is a more nuanced Zarr insert / update / upsert / merge utility with a richer syntax.

@slevang
Copy link
Contributor

slevang commented Apr 29, 2024

Your mention of merge gets me thinking in terms of the standard xarray ops every knows well. mode="a" basically covers what xr.merge does, and between compat, join, and combine_attrs you have great power to enumerate all the potential combination options. mode="r+" is like direct array assignment ds[x][:10, :] = .... The trick with moving the API in this direction is that a lot of these type of comparison ops have significant overhead when the data is stored on disk so we want to preserve "fast path" optimizations more conducive to i/o.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug topic-zarr Related to zarr storage library
Projects
None yet
Development

No branches or pull requests

6 participants