Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Issue with decoding dataset datetime that include NaN values (cftime and linux bug) #648

Closed
ezekiel-lemur opened this issue Apr 23, 2024 · 19 comments
Labels
type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors.

Comments

@ezekiel-lemur
Copy link

What happened?

Issue with decoding datetime that include NaN values

What did you expect to happen? Are there are possible answers you came across?

No response

Minimal Complete Verifiable Example (MVCE)

path1 = '/nethome/0641332/Thesis/Data/Parcels/standard6Controlnew_run.nc'
ds = xr.open_dataset(path1, decode_times=False)
ds['time'].attrs['axis'] = 'time'
ds['time'].attrs['long_name'] = 'time'
ds = ds.set_coords(("time"))
ds_decoded = xcdat.decode_time(ds)


the ds file looks like:
<xarray.Dataset>
Dimensions:     (trajectory: 92678, obs: 920)
Coordinates:
  * obs         (obs) int32 0 1 2 3 4 5 6 7 ... 912 913 914 915 916 917 918 919
    time        (trajectory, obs) float64 3.478e+08 3.477e+08 ... nan nan
  * trajectory  (trajectory) int64 146163 146164 146165 ... 59605 60362 65431
Data variables:
    S           (trajectory, obs) float32 ...
    T           (trajectory, obs) float32 ...
    age         (trajectory, obs) float32 ...
    lat         (trajectory, obs) float32 ...
    lon         (trajectory, obs) float32 ...
    transport   (trajectory) float32 ...
    z           (trajectory, obs) float32 ...
Attributes:
    Conventions:            CF-1.6/CF-1.7
    feature_type:           trajectory
    ncei_template_version:  NCEI_NetCDF_Trajectory_Template_v2.0
    parcels_kernels:        SampleParticleAdvectionRK4_3DSample_T_SCheckPreBo...
    parcels_mesh:           spherical
    parcels_version:        3.0.2


and the time axis:
(trajectory, obs)
float64
3.478e+08 3.477e+08 ... nan nan
axis : time
calendar : noleap
long_name : time
standard_name :time
units : seconds since 0488-02-01 00:00:00

Relevant log output

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[54], line 1
----> 1 ds_decoded = xcdat.decode_time(ds)

File ~/miniconda3/envs/parcels/lib/python3.12/site-packages/xcdat/dataset.py:370, in decode_time(dataset)
    367 ds = ds.assign_coords({coords.name: decoded_time})
    369 try:
--> 370     bounds = ds.bounds.get_bounds("T", var_key=coords.name)
    371 except KeyError:
    372     bounds = None

File ~/miniconda3/envs/parcels/lib/python3.12/site-packages/xcdat/bounds.py:228, in BoundsAccessor.get_bounds(self, axis, var_key)
    199 def get_bounds(
    200     self, axis: CFAxisKey, var_key: Optional[str] = None
    201 ) -> Union[xr.Dataset, xr.DataArray]:
    202     """Gets coordinate bounds.
    203 
    204     Parameters
   (...)
    226         If bounds were not found for the specific ``axis``.
    227     """
--> 228     self._validate_axis_arg(axis)
    230     if var_key is None:
    231         # Get all bounds keys in the Dataset for this axis.
    232         bounds_keys = self._get_bounds_keys(axis)

File ~/miniconda3/envs/parcels/lib/python3.12/site-packages/xcdat/bounds.py:855, in BoundsAccessor._validate_axis_arg(self, axis)
    850     keys = ", ".join(f"'{key}'" for key in cf_axis_keys)
    851     raise ValueError(
    852         f"Incorrect 'axis' argument value. Supported values include {keys}."
    853     )
--> 855 get_dim_coords(self._dataset, axis)

File ~/miniconda3/envs/parcels/lib/python3.12/site-packages/xcdat/axis.py:132, in get_dim_coords(obj, axis)
    127 index_keys = obj.indexes.keys()
    129 # Attempt to map the axis it all of its coordinate variable(s) using the
    130 # axis and coordinate names in the object attributes (if they are set).
    131 # Example: Returns ["time", "time_centered"] with `axis="T"`
--> 132 coord_keys = _get_all_coord_keys(obj, axis)
    133 # Filter the index keys to just the dimension coordinate keys.
    134 # Example: Returns ["time"], since "time_centered" is not in `index_keys`
    135 dim_coord_keys = list(set(index_keys) & set(coord_keys))

File ~/miniconda3/envs/parcels/lib/python3.12/site-packages/xcdat/axis.py:317, in _get_all_coord_keys(obj, axis)
    314     pass
    316 try:
--> 317     keys = keys + obj.cf.coordinates[cf_attrs["coordinate"]]
    318 except KeyError:
    319     pass

File ~/miniconda3/envs/parcels/lib/python3.12/site-packages/cf_xarray/accessor.py:1659, in CFAccessor.coordinates(self)
   1642 @property
   1643 def coordinates(self) -> dict[str, list[Hashable]]:
   1644     """
   1645     Mapping valid Coordinate standard names for ``.cf[]`` to variable names.
   1646 
   (...)
   1657         Values are lists of variable names that match that particular key.
   1658     """
-> 1659     vardict = {key: _get_coords(self._obj, key) for key in _COORD_NAMES}
   1661     return {k: sort_maybe_hashable(v) for k, v in vardict.items() if v}

File ~/miniconda3/envs/parcels/lib/python3.12/site-packages/cf_xarray/accessor.py:505, in _get_coords(obj, key)
    500 def _get_coords(obj: DataArray | Dataset, key: Hashable) -> list[Hashable]:
    501     """
    502     One or more of ('X', 'Y', 'Z', 'T', 'longitude', 'latitude', 'vertical', 'time',
    503     'area', 'volume'), or arbitrary measures, or standard names present in .coords
    504     """
--> 505     return [k for k in _get_all(obj, key) if k in obj.coords or k in obj.dims]

File ~/miniconda3/envs/parcels/lib/python3.12/site-packages/cf_xarray/accessor.py:480, in _get_all(obj, key)
    467 """
    468 One or more of ('X', 'Y', 'Z', 'T', 'longitude', 'latitude', 'vertical', 'time',
    469 'area', 'volume'), or arbitrary measures, or standard names
    470 """
    471 all_mappers: tuple[Mapper] = (
    472     _get_custom_criteria,
    473     functools.partial(_get_custom_criteria, criteria=cf_role_criteria),  # type: ignore[assignment]
   (...)
    478     _get_with_standard_name,
    479 )
--> 480 results = apply_mapper(all_mappers, obj, key, error=False, default=None)
    481 return results

File ~/miniconda3/envs/parcels/lib/python3.12/site-packages/cf_xarray/accessor.py:137, in apply_mapper(mappers, obj, key, error, default)
    135 results = []
    136 for mapper in mappers:
--> 137     results.append(_apply_single_mapper(mapper))
    139 flat = list(itertools.chain(*results))
    140 # de-duplicate

File ~/miniconda3/envs/parcels/lib/python3.12/site-packages/cf_xarray/accessor.py:121, in apply_mapper.<locals>._apply_single_mapper(mapper)
    119 def _apply_single_mapper(mapper):
    120     try:
--> 121         results = mapper(obj, key)
    122     except (KeyError, ValueError) as e:
    123         if error or "I expected only one." in repr(e):

File ~/miniconda3/envs/parcels/lib/python3.12/site-packages/cf_xarray/accessor.py:320, in _get_axis_coord(obj, key)
    317     results.update((coord,))
    318 if criterion == "units":
    319     # deal with pint-backed objects
--> 320     units = getattr(var.data, "units", None)
    321     if units in expected:
    322         results.update((coord,))

File ~/miniconda3/envs/parcels/lib/python3.12/site-packages/xarray/core/dataarray.py:756, in DataArray.data(self)
    744 @property
    745 def data(self) -> Any:
    746     """
    747     The DataArray's data as an array. The underlying array type
    748     (e.g. dask, sparse, pint) is preserved.
   (...)
    754     DataArray.values
    755     """
--> 756     return self.variable.data

File ~/miniconda3/envs/parcels/lib/python3.12/site-packages/xarray/core/variable.py:416, in Variable.data(self)
    414     return self._data
    415 elif isinstance(self._data, indexing.ExplicitlyIndexed):
--> 416     return self._data.get_duck_array()
    417 else:
    418     return self.values

File ~/miniconda3/envs/parcels/lib/python3.12/site-packages/xarray/coding/variables.py:74, in _ElementwiseFunctionArray.get_duck_array(self)
     73 def get_duck_array(self):
---> 74     return self.func(self.array.get_duck_array())

File ~/miniconda3/envs/parcels/lib/python3.12/site-packages/xcdat/dataset.py:649, in _get_cftime_coords(offsets, units, calendar)
    646 units_type, ref_date = units.split(" since ")
    648 if units_type not in NON_CF_TIME_UNITS:
--> 649     return decode_cf_datetime(offsets, units, calendar=calendar, use_cftime=True)
    651 offsets = np.asarray(offsets)
    652 flat_offsets = offsets.ravel()

File ~/miniconda3/envs/parcels/lib/python3.12/site-packages/xarray/coding/times.py:342, in decode_cf_datetime(num_dates, units, calendar, use_cftime)
    340                 dates = cftime_to_nptime(dates)
    341 elif use_cftime:
--> 342     dates = _decode_datetime_with_cftime(flat_num_dates, units, calendar)
    343 else:
    344     dates = _decode_datetime_with_pandas(flat_num_dates, units, calendar)

File ~/miniconda3/envs/parcels/lib/python3.12/site-packages/xarray/coding/times.py:237, in _decode_datetime_with_cftime(num_dates, units, calendar)
    234     raise ModuleNotFoundError("No module named 'cftime'")
    235 if num_dates.size > 0:
    236     return np.asarray(
--> 237         cftime.num2date(num_dates, units, calendar, only_use_cftime_datetimes=True)
    238     )
    239 else:
    240     return np.array([], dtype=object)

File src/cftime/_cftime.pyx:630, in cftime._cftime.num2date()

File src/cftime/_cftime.pyx:499, in cftime._cftime.decode_dates_from_array()

TypeError: unsupported operand type(s) for +: 'cftime._cftime.DatetimeNoLeap' and 'NoneType'

Anything else we need to know?

No response

Environment

version('xcdat'): '0.7.0'

INSTALLED VERSIONS

commit: None
python: 3.12.0 | packaged by conda-forge | (main, Oct 3 2023, 08:43:22) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-513.24.1.el8_9.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.2
libnetcdf: 4.9.2

xarray: 2023.11.0
pandas: 2.1.4
numpy: 1.26.3
scipy: 1.11.4
netCDF4: 1.6.5
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.16.1
cftime: 1.6.3
nc_time_axis: 1.4.1
iris: None
bottleneck: 1.3.5
dask: 2023.12.0
distributed: 2023.12.0
matplotlib: 3.8.0
cartopy: 0.22.0
seaborn: None
numbagg: None
fsspec: 2023.10.0
cupy: None
pint: None
sparse: 0.15.1
flox: None
numpy_groupies: None
setuptools: 68.2.2
pip: 23.3.1
conda: None
pytest: 7.4.3
mypy: None
IPython: 8.20.0
sphinx: None

@ezekiel-lemur ezekiel-lemur added the type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors. label Apr 23, 2024
@ezekiel-lemur ezekiel-lemur changed the title [Bug]: [Bug]: Issue with decoding dataset datetime that include NaN values Apr 23, 2024
@tomvothecoder
Copy link
Collaborator

Thanks for opening this issue @ezekiel-lemur! We'll take a look at it.

@lee1043
Copy link
Collaborator

lee1043 commented Apr 23, 2024

It looks like the data used here does not have standard or conventional time axis. @ezekiel-lemur would it be possible to share your data so one can try reproduce your error?

@ezekiel-lemur
Copy link
Author

Thank you @lee1043 and @tomvothecoder !
I have uploaded a small subset of the data (rightfully named subset.nc), it breaks with my minimum breaking example too.

subset.nc.zip

@lee1043
Copy link
Collaborator

lee1043 commented Apr 23, 2024

@ezekiel-lemur thanks for the data. I was able to reproduce the error.

I think the issue might be related to the 2-dimensional time coordinate. In most case xcdat presumes dataArray for coordinate or axis to be 1-dimensional (@tomvothecoder please correct me if I am wrong), but your time dimension has 2 dimensions. And in the dataset, time coordinate is not associated with any of data variables (e.g., S, T, age, ...).

It looks like there are different observation points and each of them has different trajectory (not sure trajectory here means time or not). I also noticed that the time coordinate repeated values for different trajectories, which I got confused for the structure of the data. Can you say little bit more about your data? e.g., expected dimensions for each variables?

@tomvothecoder
Copy link
Collaborator

In most case xcdat presumes dataArray for coordinate or axis to be 1-dimensional (@tomvothecoder please correct me if I am wrong)

Thanks for helping debug, Jiwoo! Yes this is correct, xCDAT expects 1-D coordinates.

@pochedls
Copy link
Collaborator

Why can't I reproduce the error (just tried 0.6.1 and 0.7.0)?

# imports
import xcdat as xc

# I/0
fn = 'subset.nc'
ds = xc.open_dataset(fn, decode_times=False)
dsd = xc.decode_time(ds)
dsd.time.values

/opt/homebrew/Caskroom/miniconda/base/envs/xcdat/lib/python3.12/site-packages/xarray/coding/times.py:240: RuntimeWarning: invalid value encountered in cast
cftime.num2date(num_dates, units, calendar, only_use_cftime_datetimes=True)
Out[46]:
array([[cftime.DatetimeNoLeap(499, 2, 11, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(499, 2, 10, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(499, 2, 9, 0, 0, 0, 0, has_year_zero=True),
...,

@lee1043
Copy link
Collaborator

lee1043 commented Apr 24, 2024

Why can't I reproduce the error (just tried 0.6.1 and 0.7.0)?

# imports
import xcdat as xc

# I/0
fn = 'subset.nc'
ds = xc.open_dataset(fn, decode_times=False)
dsd = xc.decode_time(ds)
dsd.time.values

@pochedls that's interesting... I still get the same error when running your code. My xcdat version is 0.6.1.

@ezekiel-lemur
Copy link
Author

ezekiel-lemur commented Apr 24, 2024 via email

@pochedls
Copy link
Collaborator

This is working for me on an M2 Mac (updated xarray and xcdat), but not linux. The problem on linux traces to cftime:

cftime.num2date(ds.time.values, ds.time.units, 'standard', only_use_cftime_datetimes=True)

Both Mac and Linux are using cftime 1.6.2.

On Linux the error I get is:

# imports
import xcdat as xc
import cftime

# I/0
fn = 'subset.nc'
ds = xc.open_dataset(fn, decode_times=False)
cftime.num2date(ds.time.values, ds.time.units, 'standard', only_use_cftime_datetimes=True)

<ipython-input-4-abe4db8192af>:1: RuntimeWarning: invalid value encountered in cast
cftime.num2date(ds.time.values, ds.time.units, 'standard', only_use_cftime_datetimes=True)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [4], line 1
----> 1 cftime.num2date(ds.time.values, ds.time.units, 'standard', only_use_cftime_datetimes=True)
File src/cftime/_cftime.pyx:630, in cftime._cftime.num2date()
File src/cftime/_cftime.pyx:499, in cftime._cftime.decode_dates_from_array()
TypeError: unsupported operand type(s) for +: 'cftime._cftime.DatetimeGregorian' and 'NoneType'

If someone else could verify this on Mac, then I think this might be isolated to a cftime + linux issue.

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Apr 24, 2024

If someone else could verify this on Mac, then I think this might be isolated to a cftime + linux issue.

I created a fresh environment with conda create -n xcdat_070 -c conda-forge xcdat=0.7.0 xarray=2024.3.0 cftime=1.6.3.

I ran both versions of the code here and here on my M2 Mac and both worked, although I get a RuntimeWarning.

xcdat.__version__
'0.7.0'

xarray.__version__
'2024.3.0'  # Also worked with '2024.2.0' and '2023.11.0'

cftime.__version__
'1.6.3'. # Also worked with '1.6.2'
<ipython-input-2-858b1454e938>:5: RuntimeWarning: invalid value encountered in cast
  cftime.num2date(ds.time.values, ds.time.units, 'standard', only_use_cftime_datetimes=True)
array([[cftime.DatetimeGregorian(499, 2, 8, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(499, 2, 7, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(499, 2, 6, 0, 0, 0, 0, has_year_zero=False),
        ...,
        cftime.DatetimeGregorian(496, 8, 5, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(496, 8, 4, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(488, 2, 1, 0, 0, 0, 0, has_year_zero=False)],
       [cftime.DatetimeGregorian(499, 2, 8, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(499, 2, 7, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(499, 2, 6, 0, 0, 0, 0, has_year_zero=False),
        ...,
        cftime.DatetimeGregorian(496, 8, 5, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(496, 8, 4, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(488, 2, 1, 0, 0, 0, 0, has_year_zero=False)],
       [cftime.DatetimeGregorian(499, 2, 8, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(499, 2, 7, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(499, 2, 6, 0, 0, 0, 0, has_year_zero=False),
        ...,
        cftime.DatetimeGregorian(496, 8, 5, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(496, 8, 4, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(488, 2, 1, 0, 0, 0, 0, has_year_zero=False)],
       ...,
       [cftime.DatetimeGregorian(499, 2, 8, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(499, 2, 7, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(499, 2, 6, 0, 0, 0, 0, has_year_zero=False),
        ...,
        cftime.DatetimeGregorian(496, 8, 5, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(496, 8, 4, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(488, 2, 1, 0, 0, 0, 0, has_year_zero=False)],
       [cftime.DatetimeGregorian(499, 2, 8, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(499, 2, 7, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(499, 2, 6, 0, 0, 0, 0, has_year_zero=False),
        ...,
        cftime.DatetimeGregorian(496, 8, 5, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(496, 8, 4, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(488, 2, 1, 0, 0, 0, 0, has_year_zero=False)],
       [cftime.DatetimeGregorian(499, 2, 8, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(499, 2, 7, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(499, 2, 6, 0, 0, 0, 0, has_year_zero=False),
        ...,
        cftime.DatetimeGregorian(496, 8, 5, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(496, 8, 4, 0, 0, 0, 0, has_year_zero=False),
        cftime.DatetimeGregorian(488, 2, 1, 0, 0, 0, 0, has_year_zero=False)]],
      dtype=object)

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Apr 24, 2024

I ran the same sets of code in my above comment on Linux (RH7) and I get the same error.

I think this confirms it is a cftime + Linux issue. I'll open a ticket over on the cftime repo.

UPDATE: Just opened a ticket on the cftime repo here

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File [/home/vo13/xCDAT/xcdat/648_qa.py:5](https://vscode-remote+ssh-002dremote-002bacme1.vscode-resource.vscode-cdn.net/home/vo13/xCDAT/xcdat/648_qa.py:5)
      [3](https://vscode-remote+ssh-002dremote-002bacme1.vscode-resource.vscode-cdn.net/home/vo13/xCDAT/xcdat/648_qa.py:3) fn = 'subset.nc'
      [4](https://vscode-remote+ssh-002dremote-002bacme1.vscode-resource.vscode-cdn.net/home/vo13/xCDAT/xcdat/648_qa.py:4) ds = xc.open_dataset(fn, decode_times=False)
----> [5](https://vscode-remote+ssh-002dremote-002bacme1.vscode-resource.vscode-cdn.net/home/vo13/xCDAT/xcdat/648_qa.py:5) cftime.num2date(ds.time.values, ds.time.units, 'standard', only_use_cftime_datetimes=True)

File src/cftime/_cftime.pyx:630, in cftime._cftime.num2date()

File src/cftime/_cftime.pyx:499, in cftime._cftime.decode_dates_from_array()

TypeError: unsupported operand type(s) for +: 'cftime._cftime.DatetimeGregorian' and 'NoneType'

@ezekiel-lemur
Copy link
Author

I ran the same sets of code in my above comment on Linux (RH7) and I get the same error.

I think this confirms it is a cftime + Linux issue. I'll open a ticket over on the cftime repo.

UPDATE: Just opened a ticket on the cftime repo here

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File [/home/vo13/xCDAT/xcdat/648_qa.py:5](https://vscode-remote+ssh-002dremote-002bacme1.vscode-resource.vscode-cdn.net/home/vo13/xCDAT/xcdat/648_qa.py:5)
      [3](https://vscode-remote+ssh-002dremote-002bacme1.vscode-resource.vscode-cdn.net/home/vo13/xCDAT/xcdat/648_qa.py:3) fn = 'subset.nc'
      [4](https://vscode-remote+ssh-002dremote-002bacme1.vscode-resource.vscode-cdn.net/home/vo13/xCDAT/xcdat/648_qa.py:4) ds = xc.open_dataset(fn, decode_times=False)
----> [5](https://vscode-remote+ssh-002dremote-002bacme1.vscode-resource.vscode-cdn.net/home/vo13/xCDAT/xcdat/648_qa.py:5) cftime.num2date(ds.time.values, ds.time.units, 'standard', only_use_cftime_datetimes=True)

File src/cftime/_cftime.pyx:630, in cftime._cftime.num2date()

File src/cftime/_cftime.pyx:499, in cftime._cftime.decode_dates_from_array()

TypeError: unsupported operand type(s) for +: 'cftime._cftime.DatetimeGregorian' and 'NoneType'

Thanks for following up with cftime :)

@tomvothecoder
Copy link
Collaborator

@ezekiel-lemur Of course! Thanks for working with us through this issue. Hopefully it can be resolved soon. The (inconvenient) alternative is to try your code on a non-Linux based machine for now.

@tomvothecoder tomvothecoder changed the title [Bug]: Issue with decoding dataset datetime that include NaN values [Bug]: Issue with decoding dataset datetime that include NaN values (cftime and linux bug) Apr 30, 2024
@tomvothecoder
Copy link
Collaborator

Just an update about why this bug only happens on Linux (comment):

I suspect that the problem is that the operations with NaNs can give compiler-dependent results (hence the difference with gcc (linux) and clang (macosx)). I think the way to avoid this is to either raise an exception if nans or infs are found, or treat them as missing values (as PR #330 does).

@ezekiel-lemur
Copy link
Author

Hey! I see the issue was closed on Nctime, but I am not sure what the solution is for now? Would you be so kind and let me know please :)

@tomvothecoder
Copy link
Collaborator

Hey! I see the issue was closed on Nctime, but I am not sure what the solution is for now? Would you be so kind and let me know please :)

Hi @ezekiel-lemur, here's a quick, temporary, hacky workaround to this issue on Linux.

It involves converting np.nan values to 0s, decoding to cftime.datetime, then converting the 0s (as cftime.datetime) back to np.nan values.

import cftime
import numpy as np
import xarray as xr
import xcdat as xc

# Open the dataset.
filepath = "subset.nc"
ds = xr.open_dataset(filepath, decode_times=False)

# _FillValue is `nan`, so we need to use another value (I choose 0)
print(ds.time.encoding["_FillValue"])

# Make sure no values are 0, which we use to represent missing values.
# If 0's are present, we will inadvertently drop actual time coordinate values
# and instead need to use another fill value.
# The print shows the shape is (100, 920), which is the same as before. 
# We can use 0 to represent missing values.
print(ds.time.where(ds.time != 0, drop=True).shape)

# Convert `np.nan` to 0 to represent missing values.
ds["time"] = ds.time.fillna(0)
ds["time"].attrs["axis"] = "time"
ds["time"].attrs["long_name"] = "time"
ds = ds.set_coords(("time"))

# Decode the time coordinates.
ds_decoded = xc.decode_time(ds)

# Convert the cftime.datetime value 0 back to `np.nan`.
fill_value = cftime.num2date([0], ds.time.units, calendar=ds.time.calendar)
ds_decoded["time"] = ds_decoded.time.where(
    ds_decoded.time != fill_value, np.nan, drop=False
)

Alternatively, you can wait for a new release of cftime that includes PR 330, or use MacOS to run your code (with the caveat being the behavior noted here).

@pochedls
Copy link
Collaborator

Just a note...it's not clear to me that things will work correctly after the cftime PR is merged (I'm not sure how xarray is going to handle masked arrays from cftime).

@tomvothecoder
Copy link
Collaborator

Just a note...it's not clear to me that things will work correctly after the cftime PR is merged (I'm not sure how xarray is going to handle masked arrays from cftime).

I'm also not sure how Xarray handles masked arrays within xr.DataArrays. If this ends up breaking something in xarray, we'll need to open up an issue there. Maybe the masked array can automatically be converted to a numpy array with missing values represented by np.nan (using this code), before the decoded time xr.DataArray is built. We can implement this logic ourselves in xCDAT's _get_cf_time_coords().

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Jun 21, 2024

FYI @ezekiel-lemur. I will close this issue now.

Maybe the masked array can automatically be converted to a numpy array with missing values represented by np.nan (using this code), before the decoded time xr.DataArray is built.

My theory above seems to be correct based on @pochedls's assessment:

I quickly looked at xarray behavior around masked arrays.

If you try to make a dataarray from a masked array, it puts a NaN where the mask is true. So maybe that cftime issue will be fine.

import numpy as np
import xarray as xr
 
x = np.array([0, 1, 2, 3, 4, 5])
xm = np.ma.masked_where(x==2, x)
xr.DataArray(data=xm, dims=['x'])
 
# <xarray.DataArray (x: 6)> Size: 48B
# array([ 0.,  1., nan,  3.,  4.,  5.])
# Dimensions without coordinates: x

cftime=1.6.4 was released a few weeks ago and it includes PR #330 which fixes this bug on Linux.

import cftime
import numpy as np
import xarray as xr
import xcdat as xc

ds = xr.open_dataset(path1, decode_times=False)
ds["time"].attrs["axis"] = "time"
ds["time"].attrs["long_name"] = "time"
ds = ds.set_coords(("time"))
ds_decoded = xc.decode_time(ds)

print(cftime.__version__)
# '1.6.4'

Note, missing values are represented by cftime objects using the units attribute. This now happens on Linux and Mac with cftime=1.6.4, before it was just with Mac on cftime=1.6.3. If desired, theses missing values can be converted back to np.nan.

print(ds_decoded.time.encoding["units"])
# seconds since 0488-02-01 00:00:00

print(ds_decoded.time.values)
# ...,
# cftime.DatetimeNoLeap(496, 8, 8, 0, 0, 0, 0, has_year_zero=True),
# cftime.DatetimeNoLeap(496, 8, 7, 0, 0, 0, 0, has_year_zero=True),
# cftime.DatetimeNoLeap(488, 2, 1, 0, 0, 0, 0, has_year_zero=True)], <---- missing value

# Convert the cftime.datetime value 0 back to `np.nan`.
fill_value = cftime.num2date([0], ds.time.units, calendar=ds.time.calendar)
ds_decoded["time"] = ds_decoded.time.where(
    ds_decoded.time != fill_value, np.nan, drop=False
)

print(ds_decoded.time.values)
#         ...,
#         cftime.DatetimeNoLeap(496, 8, 8, 0, 0, 0, 0, has_year_zero=True),
#         cftime.DatetimeNoLeap(496, 8, 7, 0, 0, 0, 0, has_year_zero=True),
#         nan],  <----- missing value

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
Status: Done
Development

No branches or pull requests

4 participants