Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netCDF IO crash #9125

Open
5 tasks done
ernimd opened this issue Jun 14, 2024 · 1 comment
Open
5 tasks done

netCDF IO crash #9125

ernimd opened this issue Jun 14, 2024 · 1 comment
Labels
bug needs triage Issue that has not been reviewed by xarray team member

Comments

@ernimd
Copy link

ernimd commented Jun 14, 2024

What happened?

I have a use case of computing a function in parallel with dask and inside the function I am opening/reading (no writing) the same .nc dataset in all those parallel calls to do computation based on it. I am getting a bunch (see the log) of different errors (or none) while running it. Most likely it's not designed to be used in this way, or it's an actual bug in either IO implementation or dask itself.

Due to non-deterministic nature you should run the MVE with while python crash.py; do :; done

What did you expect to happen?

I would expect it to fail gracefully, maybe warning about dangerous multi-thread reads.

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np
from dask import delayed, compute

# create dummy.nc
dummy_data = xr.Dataset(
    {
        "var": (("x", "y"), np.random.rand(10, 10)),
    }
)
dummy_data = dummy_data.chunk({"x": 5, "y": 5}) # not necessary for reproducing...
dummy_data.to_netcdf("dummy.nc")
del dummy_data

def add_columns(value):
    ds = xr.open_dataset(
        "dummy.nc"
    )
    value *= np.random.choice(ds["var"].values.reshape(-1))
    value *= np.random.choice(ds["var"].values.reshape(-1))
    return value

res = []
for _ in range(10):
    res.append(delayed(add_columns)(np.random.rand(1)))
the_ans = compute(*res)

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

Error log 1

python3: /io/netcdf-c-4.9.2/libsrc4/nc4internal.c:426: nc4_find_nc_grp_h5: Assertion `my_h5 && my_h5->root_grp' failed.
Aborted (core dumped)

Error log 2

HDF5-DIAG: Error detected in HDF5 (1.12.2) thread 1:
  #000: H5A.c line 1327 in H5Aiterate2(): invalid location identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: H5VLint.c line 1749 in H5VL_vol_object(): invalid identifier type to function
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.12.2) thread 2:
  #000: H5G.c line 548 in H5Gget_create_plist(): not a group ID
    major: Invalid arguments to routine
    minor: Inappropriate type
Segmentation fault (core dumped)

Error log 3

Traceback (most recent call last):
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/core/dataset.py", line 1446, in _construct_dataarray
    variable = self._variables[name]
KeyError: 'var'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/core/dataset.py", line 1545, in __getitem__
    return self._construct_dataarray(key)
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/core/dataset.py", line 1448, in _construct_dataarray
    _, name, variable = _get_virtual_variable(self._variables, name, self.sizes)
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/core/dataset.py", line 207, in _get_virtual_variable
    raise KeyError(key)
KeyError: 'var'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ernie/code/corres/crash_netcdf5.py", line 29, in <module>
    the_ans = compute(*res)
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/dask/base.py", line 661, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/ernie/code/corres/crash_netcdf5.py", line 21, in add_columns
    value *= np.random.choice(ds["var"].values.reshape(-1))
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/core/dataset.py", line 1547, in __getitem__
    raise KeyError(
KeyError: "No variable named 'var'. Variables on the dataset include []"


Error log 4

Traceback (most recent call last):
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/core/utils.py", line 635, in close_on_error
    yield
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/backends/netCDF4_.py", line 660, in open_dataset
    ds = store_entrypoint.open_dataset(
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/backends/store.py", line 43, in open_dataset
    vars, attrs = filename_or_obj.load()
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/backends/common.py", line 251, in load
    (_decode_variable_name(k), v) for k, v in self.get_variables().items()
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/backends/netCDF4_.py", line 462, in get_variables
    return FrozenDict(
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/core/utils.py", line 443, in FrozenDict
    return Frozen(dict(*args, **kwargs))
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/backends/netCDF4_.py", line 463, in <genexpr>
    (k, self.open_store_variable(k, v)) for k, v in self.ds.variables.items()
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/backends/netCDF4_.py", line 424, in open_store_variable
    attributes = {k: var.getncattr(k) for k in var.ncattrs()}
  File "src/netCDF4/_netCDF4.pyx", line 4585, in netCDF4._netCDF4.Variable.ncattrs
  File "src/netCDF4/_netCDF4.pyx", line 1521, in netCDF4._netCDF4._get_att_names
  File "src/netCDF4/_netCDF4.pyx", line 2034, in netCDF4._netCDF4._ensure_nc_success
AttributeError: NetCDF: Not a valid ID

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ernie/code/corres/crash_netcdf5.py", line 29, in <module>
    the_ans = compute(*res)
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/dask/base.py", line 661, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/ernie/code/corres/crash_netcdf5.py", line 17, in add_columns
    ds = xr.open_dataset(
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/backends/api.py", line 573, in open_dataset
    backend_ds = backend.open_dataset(
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/backends/netCDF4_.py", line 659, in open_dataset
    with close_on_error(store):
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/core/utils.py", line 637, in close_on_error
    f.close()
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/backends/netCDF4_.py", line 578, in close
    self._manager.close(**kwargs)
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/backends/file_manager.py", line 234, in close
    file.close()
  File "src/netCDF4/_netCDF4.pyx", line 2627, in netCDF4._netCDF4.Dataset.close
  File "src/netCDF4/_netCDF4.pyx", line 2590, in netCDF4._netCDF4.Dataset._close
  File "src/netCDF4/_netCDF4.pyx", line 2034, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: NetCDF: Not a valid ID

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 6.5.0-35-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.9.3-development

xarray: 2024.3.0
pandas: 2.2.2
numpy: 1.26.4
scipy: 1.13.0
netCDF4: 1.6.5
pydap: None
h5netcdf: 1.3.0
h5py: 3.11.0
Nio: None
zarr: 2.17.2
cftime: 1.6.3
nc_time_axis: None
iris: None
bottleneck: None
dask: 2024.4.2
distributed: 2024.4.2
matplotlib: 3.8.4
cartopy: None
seaborn: None
numbagg: None
fsspec: 2024.3.1
cupy: None
pint: None
sparse: None
flox: 0.9.7
numpy_groupies: 0.11.1
setuptools: 69.5.1
pip: 24.0
conda: None
pytest: 7.4.0
mypy: None
IPython: 8.22.2
sphinx: None

@ernimd ernimd added bug needs triage Issue that has not been reviewed by xarray team member labels Jun 14, 2024
@keewis
Copy link
Collaborator

keewis commented Jun 19, 2024

This also appears with open_mfdataset, but appears to only happen with a threading scheduler.

In any case, the reason is that netcdf4 is not thread-safe anymore since netcdf4=1.6.1 (I think). Which means this has unfortunately been around for a while and is known (see e.g. #7079), but nobody had the time / skills / persistence to figure out what exactly causes this – race conditions are just that tricky to debug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs triage Issue that has not been reviewed by xarray team member
Projects
None yet
Development

No branches or pull requests

2 participants