Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Add functionality to accurately set time bounds #412

Closed
pochedls opened this issue Jan 26, 2023 · 19 comments · Fixed by #418
Closed

[Feature]: Add functionality to accurately set time bounds #412

pochedls opened this issue Jan 26, 2023 · 19 comments · Fixed by #418
Assignees
Labels
type: enhancement New enhancement request

Comments

@pochedls
Copy link
Collaborator

Is your feature request related to a problem?

Following Discussion #411 – xCDAT should modify how it handles bounds in general and time bounds in particular. xCDAT currently uses the bounds within the dataset, but if they are not present it will generate bounds for each axis (by assuming that the bounds are the midpoint between each axis point). This is problematic for time coordinates because a) time points are frequently not saved out at the midpoint value and b) months are not equal in size so the halfway point between two months do not yield the correct bound.

This is illustrated in Slide 4 of climate_specific_tools.pdf.

For example, ERA5 monthly mean data is saved at midnight on the first day of the month:

import xcdat as xc
fn = '/p/user_pub/climate_work/pochedley1/ecmwf/era5/new/tas_era5.nc'
ds = xc.open_dataset(fn)
ds.time[0]

...
array(cftime.DatetimeGregorian(1979, 1, 1, 0, 0, 0, 0, has_year_zero=False),
dtype=object)
...

xCDAT then autogenerates incorrect bounds:

print(ds.bounds.get_bounds('T')[0])

...
array([cftime.DatetimeGregorian(1978, 12, 16, 12, 0, 0, 0, has_year_zero=False),
cftime.DatetimeGregorian(1979, 1, 16, 12, 0, 0, 0, has_year_zero=False)],
dtype=object)
...
xcdat_bounds: True

Note that this issue also applies to vertical coordinate which are frequently not equally spaced.

Describe the solution you'd like

Addressing this issue would take a few changes:

  • By default, auto-generate missing bounds for latitude and longitude only (i.e., do not automatically set bounds for vertical and temporal coordinates). This would require modification of xcdat.open_dataset() (in function and documentation).
  • Create functions to create sub-daily, daily, and monthly time bounds, e.g., setAxisTimeBoundsMonthly() that set the time bounds at the start and end of the period (e.g., for the ERA5 case above, the bounds should be 1979-01-01 00:00:00 and 1979-02-01 00:00:00).
  • Consider allowing the user to set the bounds that xcdat creates
    • This could be a) an option in xcdat.open_dataset or b) a global option as in CDAT's setAutoBounds() function: Allow the user to determine which of t, x, y, z should have bounds creation, e.g., gen_bounds = None | List['X', 'Y', 'T', 'Z']
    • If this were implemented we'd likely need some way of inferring what function to use when setting the time bounds (as in isMonthly())

Describe alternatives you've considered

I think that the minimal solution would be to disable temporal and vertical bounds generation by default on dataset open (users could still call xCDAT's functions to generate bounds at their own risk).

Additional context

I think the above changes would largely address the issue.

I did think about an edge case that I think would be mishandled by related tempoeral functions (e.g., xcdat.temporal.group_average()). Consider a dataset with bounds (so xCDAT would not generate bounds) that are weird (but correct):

Center Point January 1, 2022 00:00:00 February 1, 2022 00:00:00 March 1, 2022 00:00:00 ...
Bounds December 15, 2022 00:00:00 January 16, 2022 00:00:00 January 16, 2022 00:00:00 February 16, 2022 00:00:00 February 16, 2022 00:00:00 March 16, 2022 00:00:00 ... ...

If we were to take an annual average of this data (for 2022), xCDAT would first figure out the weights by taking the difference of the upper and lower bounds (31, 31, 28, ...) and use this vector as the weights in the annual average. But some of the data should not be included in the annual average; according to the bounds ~half of the January data is supposedly in December (and the weight should be 15 days for the annual average in 2022).

I don't think this would be an issue for the average function, but probably does affect climatology and departures.

I think this is likely an edge case and I'm not sure what CDAT does in this situation, but it came to mind in thinking about this issue.

@pochedls pochedls added the type: enhancement New enhancement request label Jan 26, 2023
@lee1043
Copy link
Collaborator

lee1043 commented Jan 26, 2023

I am glad @pochedls brought this subject for the team's attention (thanks, @pochedls!). In PMP workflow I have used cdutil.setTimeBoundsDaily to ensure monthly time series of models to have correct monthly bounds because some models have their time point in the middle of month while others have beginning of month. Use case for cdutil.setTimeBoundsDaily is described in this document (search for setTimeBoundsDaily). I was hoping we can have the similar capability in xCDAT. I second that @pochedls suggests.

Below could be somewhat related (thanks @taylor13 for pointing this to us)
https://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html#climatological-seasons-ex

@taylor13
Copy link

I would note that even for latitude/longitude grids, creating bounds half-way between coordinate values is not always correct. For gaussian grids and other latxlon grids that are not equally spaced, you would get the wrong values, in general, for the bounds. I would only auto-generate bounds after determining whether a coordinate represented equal intervals (of whatever it measures, be it latitude, time, vertical position, or whatever).

@pochedls
Copy link
Collaborator Author

I would note that even for latitude/longitude grids, creating bounds half-way between coordinate values is not always correct. For gaussian grids and other latxlon grids that are not equally spaced, you would get the wrong values, in general, for the bounds. I would only auto-generate bounds after determining whether a coordinate represented equal intervals (of whatever it measures, be it latitude, time, vertical position, or whatever).

Thanks for reviewing this @taylor13. If a model doesn't provide bounds and a user needed to do an operation with bounds, I think most people would still want to use the generated bounds even if they were imperfect. Do you think a warning would be appropriate in this situation? The user could then decide if they wanted to not use the dataset (or email the modeling center to ask for the bounds?).

@taylor13
Copy link

Yes, I would certainly include a warning, and for most cases one might get away with inaccurate bounds, so generating them is probably what most users will want.

@chengzhuzhang
Copy link
Collaborator

I appreciate this discussion. Personally, for time dimension, i always thought time bounds are more reliable than the time points, because different data centers store their data at different points of the range (beginning, middle, or end). When the time bounds are missing, to generate the accurate bounds, one needs to know exactly how the data is recorded. I agree that we should provide option for these users (by providing similar function provided by setTimeBoundsFREQ).

However, I was not aware of how to properly use setTimeBoundsFREQ until reading the discussion here #411.

If looping over multiple models, using setTimeBoundsMonthly(s) without specifying the second argument, would assume data recorded at beginning of the time interval for all models, which could result inaccurate bounds if otherwise.

If the data users don't know how the time is recorded, I think to provide an educated guess with a warning is still valuable.

@lee1043
Copy link
Collaborator

lee1043 commented Jan 27, 2023

If looping over multiple models, using setTimeBoundsMonthly(s) without specifying the second argument, would assume data recorded at beginning of the time interval for all models, which could result inaccurate bounds if otherwise.

@chengzhuzhang Regarding setTimeBoundsMonthly in my understanding default second argument is 0, which set time bounds to beginning and end of month, by default. I think the argument option for data recorded at end of month was added to prevent occasions such for example data recorded at 2/29 but somehow with combined with incorrect calendar or operation it could be recognized as 3/1. So the option is to prevent such case (see here). I think such occasion might be less common with current xarray/xcdat data handling.

@chengzhuzhang
Copy link
Collaborator

Hi @lee1043 thank you for pointing to the code! I guess I was confused by the documentation as follows...

>>> import cdutil, cdms
>>> f = cdms.open('some_monthly_data.nc')
>>> x = f('var')
>>> cdutil.setTimeBoundsMonthly(x)
# The default action of the setTimeBoundsMonthly function assumes that the
time point is at the beginning of the month. To compute bounds assuming that
the time point at the end of the month,
>>> cdutil.setTimeBoundsMonthly(x, 1)

The native E3SM output records time at the end of time intervals, luckily it provides time bounds and xcdat can open the datasets and correct it with time centered.

@jypeter
Copy link

jypeter commented Jan 27, 2023

Thanks @pochedls for generating this feature request!

The user should definitely get warnings when missing bounds are automatically generated at midpoint. The midpoint option is probably a better than nothing poor-man option.

I'm not sure other options than midpoint would make sense. If somebody is going to rewrite/improve some functions, you may want to get rid of the width parameter that appears in

I have tried to follow the source links in xcdat.bounds.BoundsAccessor, but I was quickly lost. I did find a reference to iris and a guess_bounds(bound_position=0.5) function. As a parameter name, bound_position sounds more like a fraction and makes more sense than width, but any other value than 0.5 (i.e midpoint) would not make much sense anyway.

So you might as well get rid of this width parameter, simplify some source code, and simplify the documentation (by only talking about midpoint)!

Would it be possible to have a bounds.add_custom_bounds function? This is probably already possible in xarray, but a nice wrapper provided by xcdat would be nice for the climate community end user. And the wrapper documentation could tell the user what type of xarray compatible parameters you have to pass, and possibly perform some basic sanity checks (e.g. axes values are in the bounds' range, bounds are monotonically increasing or decreasing, this kind of stuff)

@taylor13
Copy link

Regarding: "width (float, optional) – Width of the bounds relative to the position of the nearest points, by default 0.5."

I agree with @jypeter that rarely would one specify anything but 0.5, but since that's the default, I don't see any harm retaining it. I think the documentation should make clear how the bounds are generated when data are unequally spaced. I don't know what the algorithm currently does, but when width=0.5, I would place the cell bound half-way between a cell's coordinate value and the next cell's coordinate value. For the first and last cells (where there is no "next cell"), I would place the bound such that the coordinate value lies half-way between its bounds (i.e., the bound would be located at the same distance from the coordinate value as the other bound). If that approach is adopted, then maybe the documentation would be clearer saying:

"width (float, optional) indicates the location of a grid cell bound expressed as a fraction of the distance from a cell's coordinate value to the coordinate value of the neighboring cell. Only when width=0.5 will the cells be contiguously located. The bound for a cell with no neighbor (i.e., the first and last cell of a dimension) will be assumed to be the same distance from the cell's coordinate value as its other bound."

@pochedls
Copy link
Collaborator Author

Regarding the start / end of the time period. In thinking about how to implement this, I think by default we would just take the start/end of the month that the time point is in (e.g., 2023-01-01 or 2023-01-15 or 2023-01-31 would all map to the same bounds: [2023-01-01 00:00:00, 2023-02-01 00:00:00]). This is easy to do, because a given time point, t, is stored as a cfTime object and you can directly retrieve the month / year (e.g., t.month would yield 1 in each case) and easily define the bounds from that information.

I think the case where you would want to use stored=1 is when the time point is stored at the end of the time period. For E3SM this would be 2023-02-01 00:00:00 (for January data) and calling t.month would give you 2 (and you don't want to set the start / end bounds to [2023-02-01 00:00:00, and 2023-02-28 00:00:00]). So if E3SM didn't have time bounds, I think we'd want that stored=1 value.

@taylor13
Copy link

@pochedls, I was addressing the behavior and documentation of Dataset.bounds.add_bounds(axis, width=0.5), not cdutil.setTimeBoundsMonthly(x). I agree that for monthly data you need a special function along the lines you suggest.

@jypeter
Copy link

jypeter commented Jan 30, 2023

I have tried to figure out how the width parameter works, but I'm not sure I really got it. It is actually used in _create_bounds(axis, coord_var, width)

Good luck if somebody wants to explain how bounds are actually created in a docstring! xcdat is still in version 0.nnn so I'm rather in favor of dropping the width=0.5 default parameter (btw, the code does not check that width is actually in the 0 to 1 range).

Maybe somebody could want to use 0 or 1, if they wanted the upper and lower bounds to have the same values as the coordinates, but then I assume they would really know what they are doing, and I'd rather have them use a bounds.add_custom_bounds function

I'd rather have a clear (i.e not confusing) documentation saying that if there are no existing bounds, a warning will be raised and the bounds will be created at the midpoint (no other option). The documentation can possibly add that the created bounds of the Y axis will be clipped in the -90 to 90 range if the Y axis has no units at all, or if the units are in degrees

@pochedls
Copy link
Collaborator Author

I think we should probably remove the width parameter unless someone wants to figure out when someone would set this to width ≠ 0.5 and then add more documentation (and perhaps a check to make sure width is within [0, 1]). Is anyone able to put together a pull request to update this? I guess this could be tacked onto this issue when someone tackles this feature request (which is related to setting time bounds correctly).

@taylor13
Copy link

I agree. I misinterpreted what "width" meant. Looking at the code, it calculates a fractional distance of "width" for one bound and "1-width" for the other bound. This ensures that all cells are contiguous. As J-Y @jypeter says, I'd rather have them custom create their own bounds.

@arfriedman
Copy link

Hi @pochedls @tomvothecoder

I linked to this discussion in response to a question about time bounds on the xarray forum in case you would like to add anything (I couldn't find how to tag you directly): pydata/xarray#7580

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Mar 8, 2023

We appreciate the xCDAT plug @arfriedman!

You brought up a great idea for us to stay active in the xarray discussions forum to be more involved in the community and also help xCDAT grow.

@pochedls
Copy link
Collaborator Author

pochedls commented Mar 8, 2023

@arfriedman – just an FYI that there is a PR under review to address the contents of this feature request. This matters for temporal averaging if your dataset does not have time bounds and you need to create them (if the dataset has time bounds they will be used for temporal averaging operations).

@tomvothecoder
Copy link
Collaborator

@pochedls yup Andrew is on top of it! He even linked to that issue in his comment.

@tomvothecoder
Copy link
Collaborator

@pochedls yup Andrew is on top of it! He even linked to that issue in his comment.

@pochedls Sorry got confused with your comment about the PR for this issue and not this issue itself.

@tomvothecoder tomvothecoder moved this from To do to In progress in Next Release (v0.5.0) Mar 20, 2023
@tomvothecoder tomvothecoder removed this from In progress in Next Release (v0.5.0) Mar 29, 2023
@tomvothecoder tomvothecoder added this to To do in Next Release (v0.6.0) via automation Mar 29, 2023
@tomvothecoder tomvothecoder moved this from To do to In progress in Next Release (v0.6.0) Mar 29, 2023
Next Release (v0.6.0) automation moved this from In progress to Done Apr 20, 2023
@tomvothecoder tomvothecoder added this to the v0.6.0 milestone Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement New enhancement request
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

7 participants