example request: LogisticRegression `s2s-ai-challenge` #7

aaronspring · 2021-11-17T10:28:36Z

I love the idea of xcast. If I want to predict multiple lead_times in a single iteration, would I just stack lead_time and X or Y?

I think more concrete examples in the documentation would help users.
i.e. something like applying xcast to s2s-ai-challenge mock data like I did in https://gist.github.com/aaronspring/36e112e992e36fba935f73404dbbd3cd and related issue (likely your approach is much more performant because the vectorize=True essentially loops the xr.apply_ufunc call, but would be great to check)

The text was updated successfully, but these errors were encountered:

aaronspring · 2021-11-18T10:06:43Z

X_train
<xarray.Dataset>
Dimensions:    (lead_time: 2, year: 18, week: 53, X: 240, Y: 121)
Coordinates:
  * lead_time  (lead_time) int64 1 2
  * year       (year) int64 2000 2001 2002 2003 2004 ... 2014 2015 2016 2017
  * week       (week) int64 0 1 2 3 4 5 6 7 8 9 ... 44 45 46 47 48 49 50 51 52
  * X          (X) int64 0 1 2 3 4 5 6 7 8 ... 232 233 234 235 236 237 238 239
  * Y          (Y) int64 0 1 2 3 4 5 6 7 8 ... 113 114 115 116 117 118 119 120
Data variables:
    t2m        (lead_time, year, week, X, Y) float64 0.3526 0.9743 ... 0.8904
    tp         (lead_time, year, week, X, Y) float64 0.7562 0.6563 ... 0.9506
    msl        (lead_time, year, week, X, Y) float64 0.7783 0.03226 ... 0.361

y_train
<xarray.Dataset>
Dimensions:    (lead_time: 2, year: 18, week: 53, X: 240, Y: 121)
Coordinates:
  * lead_time  (lead_time) int64 1 2
  * year       (year) int64 2000 2001 2002 2003 2004 ... 2014 2015 2016 2017
  * week       (week) int64 0 1 2 3 4 5 6 7 8 9 ... 44 45 46 47 48 49 50 51 52
  * X          (X) int64 0 1 2 3 4 5 6 7 8 ... 232 233 234 235 236 237 238 239
  * Y          (Y) int64 0 1 2 3 4 5 6 7 8 ... 113 114 115 116 117 118 119 120
Data variables:
    t2m        (lead_time, year, week, X, Y) float64 1.0 2.0 1.0 ... 0.0 0.0 2.0
    tp         (lead_time, year, week, X, Y) float64 2.0 1.0 1.0 ... 1.0 1.0 2.0
    msl        (lead_time, year, week, X, Y) float64 2.0 0.0 0.0 ... 1.0 1.0 1.0

emlr = xc.eMultivariateLogisticRegression()

emlr.fit(X_train.stack(S=('year','week')).isel(lead_time=0).to_array(),
         y_train.isel(lead_time=0)[['t2m']].to_array().stack(S=('year','week')),
         x_feature_dim='variable', x_sample_dim='S',
         y_sample_dim='S', y_feature_dim='variable')

emlr.predict(X_train.stack(S=('year','week')).isel(lead_time=0).to_array(),
         x_feature_dim='variable', x_sample_dim='S').rename({'variable':'category'})#.unstack()

<xarray.DataArray (Y: 121, X: 240, ND: 1, S: 954, category: 3)>
array([[[[[0.33, 0.34, 0.33],
          [0.33, 0.34, 0.33],
          [0.33, 0.34, 0.33],

My first usage observations:

probably did something wrong, 1/3 results everywhere constrasting https://gist.github.com/aaronspring/36e112e992e36fba935f73404dbbd3cd
is your predict like predict_proba here?
I find it unintuitive to to set a y_feature_dim, doesnt usually only X has features?
I find it unintuitive to to set a x_sample_dim and y_sample_dim, what about just sample_dim to be required on both?
could _sample_dim also be a list and xcast stacks, does its stuff and unstacks again?
Did I get the meaning of variable in the return right as category? whays ND for?
you require N=4-5 dim arrays. would be nice to lift that restrict allowing from N=1 to N=higher with dimensions being broadcasted

kjhall01 · 2021-11-20T19:55:24Z

Hi Aaron,

1/ If I want to predict multiple lead_times in a single iteration, would I just stack lead_time and X or Y?

I dont know how to do it off the top of my head, but I would do something akin to np.hstack along the X-axis if you want to do multiple lead times at a time. That way instead of (X, Y, M, S, L) you have (X*L, Y, M, S, L) and then you can un-stack again to get back the L dimension. Like you seem do have done with the variables of the dataset above.

2/ probably did something wrong, 1/3 results everywhere constrasting

It looks like you haven't "one-hot-encoded" your observations here. Like many sklearn classifiers, XCast.eMultivariateLogisticRegression accepts a categorical target vector- for example, with a dataset with time series [AN, BN, NN, NN, BN, AN] xcast would be looking for [ [0, 0, 1], [1, 0, 0], [0,1,0], [0, 1, 0], [1, 0,0], [0,0,1]].

you can use either Y = xc.NormalTerciles().fit(Y).transform(Y) , which uses a normal distribution to classify data into the tercile categories according to [0, 0.33], [0.34, 0.66], [0.67, 1], or xc.RankedTerciles().fit(Y).tranform(Y) to use a ranked approach, skipping the normality assumption.

eMultivariateLogisticRegression is a simple estimator, rather than a prepared MME class. you should use pMultivariateELR instead if you want this all to be taken care of for you.

3/ is your predict like predict_proba here?

Yes

4/ I find it unintuitive to to set a y_feature_dim, doesnt usually only X has features?

in probabilistic machine learning, it is common for things to be 'one-hot-encoded' like I described above, and so Y does have features in that case.

5/ I find it unintuitive to to set a x_sample_dim and y_sample_dim, what about just sample_dim to be required on both?

the initial motivation was to make if flexible to use with different NETCDF data, with minimal preprocessing; but I do see your point. I would like to implement something that tries to guess common dimension names (ie, X, LON, Lon, Longitude, long all mean the same thing generally)

6/ could _sample_dim also be a list and xcast stacks, does its stuff and unstacks again?
It could in theory, but I don't have plans to implement this any time soon

7/ Did I get the meaning of variable in the return right as category? whays ND for?
yes, you are right in the above as to the category dimension. ND is used for "Non-Deterministic" models, where things are randomly initialized and require many parallel runs to get reproducible/meaningful results. Only some classes implement Non-determinism, so the ones that don't end up with ND= 1.

8/ you require N=4-5 dim arrays. would be nice to lift that restrict allowing from N=1 to N=higher with dimensions being broadcasted

this would be nice; and potentially some day it can be added, but can also result in .fit's that take hours upon hours

Generally, I am currently working on re-doing the documentation and making more examples and walktrhough guides.

aaronspring · 2021-11-21T00:09:23Z

thanks for the reply. looking forward to an eLR example, couldnt get it running even with hot-encoding.

understand 4/ and 5/ now better

kjhall01 · 2021-11-21T00:23:40Z

I'll provide a similar example tomorrow, in the mean time I'd suggest dropping the coordinates on your inputs that aren't dimensions

kjhall01 · 2022-04-04T15:17:10Z

Hey Aaron,

Sorry for the delay on this; work is hectic. I'm going to close this issue because I'm currently re-doing all of the documentation and examples anyway.
I'll comment when I have an example doing probabilistic forecasting.

kjhall01 · 2022-04-04T17:48:05Z

kjhall01.github.io/xcast

kjhall01 closed this as completed Apr 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example request: LogisticRegression `s2s-ai-challenge` #7

example request: LogisticRegression `s2s-ai-challenge` #7

aaronspring commented Nov 17, 2021 •

edited

Loading

aaronspring commented Nov 18, 2021 •

edited

Loading

kjhall01 commented Nov 20, 2021

aaronspring commented Nov 21, 2021

kjhall01 commented Nov 21, 2021

kjhall01 commented Apr 4, 2022

kjhall01 commented Apr 4, 2022

example request: LogisticRegression s2s-ai-challenge #7

example request: LogisticRegression s2s-ai-challenge #7

Comments

aaronspring commented Nov 17, 2021 • edited Loading

aaronspring commented Nov 18, 2021 • edited Loading

kjhall01 commented Nov 20, 2021

aaronspring commented Nov 21, 2021

kjhall01 commented Nov 21, 2021

kjhall01 commented Apr 4, 2022

kjhall01 commented Apr 4, 2022

example request: LogisticRegression `s2s-ai-challenge` #7

example request: LogisticRegression `s2s-ai-challenge` #7

aaronspring commented Nov 17, 2021 •

edited

Loading

aaronspring commented Nov 18, 2021 •

edited

Loading