Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Out of bounds timestamp for temporal averaging #301

Closed
pochedls opened this issue Aug 4, 2022 · 1 comment · Fixed by #302
Closed

[Bug]: Out of bounds timestamp for temporal averaging #301

pochedls opened this issue Aug 4, 2022 · 1 comment · Fixed by #302
Assignees
Labels
type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors.

Comments

@pochedls
Copy link
Collaborator

pochedls commented Aug 4, 2022

What happened?

In trying to compute annual averages for data with non-cf-compliant time units I get an OutOfBoundsError.

What did you expect to happen?

I think this functionality should work with non-cf-compliant time.

Minimal Complete Verifiable Example

import xcdat

fn = '/p/user_pub/climate_work/pochedley1/cmip6_msu/ttt_Amon_ACCESS-ESM1-5_piControl_r1i1p1f1_gr_10101-100012.nc'
ds = xcdat.open_dataset(fn)
ds_yearly = ds.temporal.group_average('tf2', freq='year', weighted=True)

Relevant log output

.
.
.

File ~/code/xcdat/xcdat/temporal.py:1358, in TemporalAccessor._convert_df_to_dt(self, df)
   1351     dates = [
   1352         cftime.datetime(year, month, day, hour)
   1353         for year, month, day, hour in zip(
   1354             df_new.year, df_new.month, df_new.day, df_new.hour
   1355         )
   1356     ]
   1357 else:
-> 1358     dates = pd.to_datetime(df_new)
   1360 return np.array(dates)

.
.
.

ValueError: cannot assemble the datetimes: Out of bounds nanosecond timestamp: 1010-10-01 00:00:00

Anything else we need to know?

This issue does not occur with original data (with cf-compliant datetimes).

The condensed log above shows that this issue arises in the _convert_df_to_dt function when the dataframe is cast to datetime objects using pandas (because year_is_unused = False). year_is_unused is set to False because the mode is set to average and the _freq is not yearly. I don't totally understand why we use cftime for this step for some conditions, but use pandas for others.

Can we use cftime for all cases? I'm wondering if we started by using pandas and then used cftime for some conditions (e.g., see the documentation), but in reality we may be able to simply use cftime. @tomvothecoder – do you remember these details well enough to know?

If I remove the if year_is_unused conditional and re-run the code I don't get an error. But I fail a bunch of unit tests – I'm not sure if this is a unit issue or an actual problem.

Also note that it is possible that this issue affects other functions/functionality (but using cftime might resolve those potential issues).

Environment

main branch

@pochedls pochedls added the type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors. label Aug 4, 2022
@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Aug 4, 2022

Thanks for raising this issue @pochedls. I'll be able to take a closer look next week, but it looks like you're beating me to it.

Here's that method's docstring to explain when cftime or pd.to_datetime() is used:

xcdat/xcdat/temporal.py

Lines 1308 to 1318 in 050d57e

def _convert_df_to_dt(self, df: pd.DataFrame) -> np.ndarray:
"""Converts a DataFrame of datetime components to datetime objects.
datetime objects require at least a year, month, and day value. However,
some modes and time frequencies don't require year, month, and/or day
for grouping. For these cases, use default values of 1 in order to
meet this datetime requirement.
If the default value of 1 is used for the years, datetime objects
must be created using `cftime.datetime` because year 1 is outside the
Timestamp-valid range.

Basically, I didn't take into account a situation where you would compute averages for datasets outside the pd.Timestamp range (e.g., start year 1010).

Can we use cftime for all cases? I'm wondering if we started by using pandas and then used cftime for some conditions (e.g., see the documentation), but in reality we may be able to simply use cftime. @tomvothecoder – do you remember these details well enough to know?

Yes, I think using cftime is the safest thing to do here because it accommodates datasets outside the pd.Timestamp range.

If I remove the if year_is_unused conditional and re-run the code I don't get an error. But I fail a bunch of unit tests – I'm not sure if this is a unit issue or an actual problem.

Also note that it is possible that this issue affects other functions/functionality (but using cftime might resolve those potential issues).

Yes, the solution is to probably remove the year_is_unused conditional altogether and fix the tests (they probably expect datetime objects and not cftime). Also, we should derive the correct cftime object based on the calendar type, similar to decode_non_cf_time().

pochedls added a commit that referenced this issue Aug 4, 2022
tomvothecoder pushed a commit that referenced this issue Aug 4, 2022
tomvothecoder pushed a commit that referenced this issue Aug 10, 2022
pochedls added a commit that referenced this issue Aug 10, 2022
* attempted fix for #301: use cftime for temporal operations

* Update temporal accessor to use specific cftime objects
- Update format of `decode_non_cf_time()` conditional statement
- Update `time_cf` with encoding dict attributes (xarray behavior)
- Update `test_dataset.py` and `test_temporal.py` with specific `cftime` date_type objects
- Add `xesmf` to `dev.yml`

Co-authored-by: Tom Vo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants