Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable custom statistics to return multiple results #3904

Open
berndbecker opened this issue Oct 7, 2020 · 14 comments
Open

Enable custom statistics to return multiple results #3904

berndbecker opened this issue Oct 7, 2020 · 14 comments
Labels
Feature: Statistics Label for reduction-like operations e.g., collapsing, aggregrating, rolling-window Peloton 🚴‍♂️ Target a breakaway issue to be caught and closed by the peloton
Projects

Comments

@berndbecker
Copy link

berndbecker commented Oct 7, 2020

✨ Feature Request

Make custom statistic return a tuple rather than a scalar.
MISSION, store a vector of threshold exceedances of increasing duration
at each gridpoint in liew for the time domain. (much shorter)

In the example

https://scitools-iris.readthedocs.io/en/stable/generated/gallery/general/plot_custom_aggregation.html
a single number is returned at each gridpoint. I am after functionality that returns more than one value for each grid point.

Motivation

Not sure if this is an issue, but I have colleagues who calculated threschold exceedance durations at great pains. Feedback on my request from an AVD surgery was also pointing to hightened frustration as to how complicated "this" is. With this I mean
doing something on a time series, stored at each grid point (3-D cube) and retaining a set of numbers rather than collapsing the time dimension to just one (max, min, mean) number.

I'm always frustrated when something is almost doable but does not quite work and
you have to go all the way back and do it with a sledge hammer.

Additional context

Click to expand this section... I need a push to understand custom statistics better.

In the attached example
( run with module load scitools/experimental-current,
python /net/home/h02/frtm/prog/wcssp/wcssp5/scripts/ts_exceedance.py)

I am compiling a threshold exceedance duration or survival function
For rainfall time series. Asking how many rainy periods were longer than 1, 2, ....5. and so on days.
This works for a demonstrator on a single time series.

Next I would like to run the same custom statistic at each grid point as in
https://scitools.org.uk/iris/docs/latest/examples/General/custom_aggregation.html#general-custom-aggregation

But I struggle to understand the shape of data being passed to aggregator, what should axis be?
And I have no idea how to store the survivers vector over the time series dimension.

But I am convinced it is not really that difficult.

Add additional verbose information in a collapsible section.

See here for further details.

@rcomer
Copy link
Member

rcomer commented Oct 7, 2020

Possibly related: #3810 #3331

@rcomer
Copy link
Member

rcomer commented Oct 8, 2020

So, if I've understood, you start with a cube that is (time, latitude, longitude), and you want to end up with a cube that is (durations, latitude, longitude), having done your calculation over time at each grid point. The problem is that the standard iris Aggregator class is designed to reduce the dimensionality down to just (latitude, longitude) when used with collapsed.

We do have the PercentileAggregator class, which has the capacity to add a "percent" dimension if you want to calculate more than one percentile. So we know that it is possible to add dimensions. That class is hard-coded to calculate percentiles though so, if you wanted to make use of it to calculate some other dimension-adding statistic, I think you'd need to subclass it. It also isn't even listed in the docs.

So possibly what we need here is to generalise PercentileAggregator into a class that could create dimension-adding aggregators based on user-defined functions.

@rcomer
Copy link
Member

rcomer commented Oct 8, 2020

Having said that, this particular statistic presumably needs information from the time coordinate. I think all the existing aggregation calculations only use the cube data. 🤔

@berndbecker
Copy link
Author

The threshold exceedance duration may live without information from the time coordinate for the time being. The PercentileAggregator would deliver on what I expected for starters to be an easy operation. For a generalization later, more complex combination of meta data is a possibility but that can wait.

@berndbecker
Copy link
Author

Perhaps it is easier if the shape of the tuple to be returned is set at the beginning. I.e it could be the list of linear regression coefficients, or the first 4 moments of normal distribution or the list of percentiles as in the Percentil Aggregator or a list of durations in time units.

@bjlittle
Copy link
Member

@rcomer Fancy taking this on?

@bjlittle bjlittle added the Peloton 🚴‍♂️ Target a breakaway issue to be caught and closed by the peloton label Oct 28, 2020
@bjlittle bjlittle added this to To do in Peloton via automation Oct 28, 2020
@rcomer
Copy link
Member

rcomer commented Oct 28, 2020

Hey @bjlittle, sorry I think I'd struggle to justify time on this one. My PRs generally fall into two categories:

  • it directly affects my (or someone in my group's) work
  • it's small enough to do "in the margins", so don't need to justify the time

While this one doesn't look huge, it looks like more that a 5 min job.

@rcomer
Copy link
Member

rcomer commented Aug 23, 2021

So possibly what we need here is to generalise PercentileAggregator into a class that could create dimension-adding aggregators based on user-defined functions.

While digging to find something else, I noticed that PercentileAggregator was in fact originally written as AdditiveAggregator but was changed "after review discussion" as part of #1569. So there were reasons to make it specific, but I can't see from that PR what the reasons were.

Here be dragons.

@pp-mo pp-mo self-assigned this Aug 25, 2021
@rcomer
Copy link
Member

rcomer commented Sep 22, 2021

Note that #3901 also makes changes to the percentile aggregator, so it may be better to wait until that is resolved before starting work on this. Otherwise we could create some nasty code conflicts.

@trexfeathers
Copy link
Contributor

trexfeathers commented Apr 6, 2022

Hi @berndbecker, sorry for the delay on this - it's both difficult and slightly niche! Is it still something you'd be interested in seeing in Iris?

If you think others would also be interested, we encourage you and them to try out the new voting feature.

@berndbecker
Copy link
Author

berndbecker commented Apr 6, 2022 via email

@rcomer
Copy link
Member

rcomer commented Apr 6, 2022

@wjbenfold has #4676 to implement an aggregator for number of days of data matching certain criteria (e.g. above a threshold), which I think addresses that Yammer thread. However, it would only handle a single threshold value at a time I think.

@wjbenfold
Copy link
Contributor

I'm currently intending that it can handle being between two thresholds (or any other criterion you can write as a lambda) but only one condition at a time, yes

@pp-mo pp-mo added the Feature: Statistics Label for reduction-like operations e.g., collapsing, aggregrating, rolling-window label Sep 7, 2022
@pp-mo pp-mo changed the title custom statistic to return a tuple rather than a scalar Allow custom statistics to return multiple results Nov 16, 2022
@pp-mo
Copy link
Member

pp-mo commented Nov 16, 2022

I just changed the title to something a bit more general.
I actually think there are two different possibilities here for extending the capabilities :

  • firstly, a calculation statistic that returns multiple statistical components
    • in these cases, the cube method (collapse/aggregrated_by/rolling_winfow) would naturally return multiple cubes instead of one
    • a classic example would be a linear regression operator, which computes "slope" + "intercept" values together
  • secondly, a statistical operation repeated over multiple thresholds, categories, etc
    • in these cases, the result would have an extra dimension -- e.g. threshold, category, histogram-bin
    • as an example, we already have the PERCENTILE operator.
      But we don't have an easy way of creating a custom statistic of this sort.
    • a relevant example that came up lately : calculating frequency of occurrence (over a time period) from category values (over time + locations)

From an efficiency point of view, it is always possible to make multiple statistical cubes, and use the CubeList.realise_data method to efficiently calculate multiple statistics over the same data.
Also, the 'extra dimension' cases can be constructed with by creating multiple statistical result cubes; adding a defining scalar coord; and merging into one.
But obviously, from a simplicity + convenience PoV this can be improved !!

@pp-mo pp-mo changed the title Allow custom statistics to return multiple results Enable custom statistics to return multiple results Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Statistics Label for reduction-like operations e.g., collapsing, aggregrating, rolling-window Peloton 🚴‍♂️ Target a breakaway issue to be caught and closed by the peloton
Projects
Peloton
Backlog
Status: No status
Development

No branches or pull requests

6 participants