Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

example request: LogisticRegression s2s-ai-challenge #7

Closed
aaronspring opened this issue Nov 17, 2021 · 6 comments
Closed

example request: LogisticRegression s2s-ai-challenge #7

aaronspring opened this issue Nov 17, 2021 · 6 comments

Comments

@aaronspring
Copy link

aaronspring commented Nov 17, 2021

I love the idea of xcast. If I want to predict multiple lead_times in a single iteration, would I just stack lead_time and X or Y?

I think more concrete examples in the documentation would help users.
i.e. something like applying xcast to s2s-ai-challenge mock data like I did in https://gist.github.com/aaronspring/36e112e992e36fba935f73404dbbd3cd and related issue (likely your approach is much more performant because the vectorize=True essentially loops the xr.apply_ufunc call, but would be great to check)

@aaronspring
Copy link
Author

aaronspring commented Nov 18, 2021

X_train
<xarray.Dataset>
Dimensions:    (lead_time: 2, year: 18, week: 53, X: 240, Y: 121)
Coordinates:
  * lead_time  (lead_time) int64 1 2
  * year       (year) int64 2000 2001 2002 2003 2004 ... 2014 2015 2016 2017
  * week       (week) int64 0 1 2 3 4 5 6 7 8 9 ... 44 45 46 47 48 49 50 51 52
  * X          (X) int64 0 1 2 3 4 5 6 7 8 ... 232 233 234 235 236 237 238 239
  * Y          (Y) int64 0 1 2 3 4 5 6 7 8 ... 113 114 115 116 117 118 119 120
Data variables:
    t2m        (lead_time, year, week, X, Y) float64 0.3526 0.9743 ... 0.8904
    tp         (lead_time, year, week, X, Y) float64 0.7562 0.6563 ... 0.9506
    msl        (lead_time, year, week, X, Y) float64 0.7783 0.03226 ... 0.361

y_train
<xarray.Dataset>
Dimensions:    (lead_time: 2, year: 18, week: 53, X: 240, Y: 121)
Coordinates:
  * lead_time  (lead_time) int64 1 2
  * year       (year) int64 2000 2001 2002 2003 2004 ... 2014 2015 2016 2017
  * week       (week) int64 0 1 2 3 4 5 6 7 8 9 ... 44 45 46 47 48 49 50 51 52
  * X          (X) int64 0 1 2 3 4 5 6 7 8 ... 232 233 234 235 236 237 238 239
  * Y          (Y) int64 0 1 2 3 4 5 6 7 8 ... 113 114 115 116 117 118 119 120
Data variables:
    t2m        (lead_time, year, week, X, Y) float64 1.0 2.0 1.0 ... 0.0 0.0 2.0
    tp         (lead_time, year, week, X, Y) float64 2.0 1.0 1.0 ... 1.0 1.0 2.0
    msl        (lead_time, year, week, X, Y) float64 2.0 0.0 0.0 ... 1.0 1.0 1.0

emlr = xc.eMultivariateLogisticRegression()

emlr.fit(X_train.stack(S=('year','week')).isel(lead_time=0).to_array(),
         y_train.isel(lead_time=0)[['t2m']].to_array().stack(S=('year','week')),
         x_feature_dim='variable', x_sample_dim='S',
         y_sample_dim='S', y_feature_dim='variable')

emlr.predict(X_train.stack(S=('year','week')).isel(lead_time=0).to_array(),
         x_feature_dim='variable', x_sample_dim='S').rename({'variable':'category'})#.unstack()

<xarray.DataArray (Y: 121, X: 240, ND: 1, S: 954, category: 3)>
array([[[[[0.33, 0.34, 0.33],
          [0.33, 0.34, 0.33],
          [0.33, 0.34, 0.33],

My first usage observations:

  • probably did something wrong, 1/3 results everywhere constrasting https://gist.github.com/aaronspring/36e112e992e36fba935f73404dbbd3cd
  • is your predict like predict_proba here?
  • I find it unintuitive to to set a y_feature_dim, doesnt usually only X has features?
  • I find it unintuitive to to set a x_sample_dim and y_sample_dim, what about just sample_dim to be required on both?
  • could _sample_dim also be a list and xcast stacks, does its stuff and unstacks again?
  • Did I get the meaning of variable in the return right as category? whays ND for?
  • you require N=4-5 dim arrays. would be nice to lift that restrict allowing from N=1 to N=higher with dimensions being broadcasted

@kjhall01
Copy link
Owner

Hi Aaron,

1/ If I want to predict multiple lead_times in a single iteration, would I just stack lead_time and X or Y?

I dont know how to do it off the top of my head, but I would do something akin to np.hstack along the X-axis if you want to do multiple lead times at a time. That way instead of (X, Y, M, S, L) you have (X*L, Y, M, S, L) and then you can un-stack again to get back the L dimension. Like you seem do have done with the variables of the dataset above.

2/ probably did something wrong, 1/3 results everywhere constrasting

It looks like you haven't "one-hot-encoded" your observations here. Like many sklearn classifiers, XCast.eMultivariateLogisticRegression accepts a categorical target vector- for example, with a dataset with time series [AN, BN, NN, NN, BN, AN] xcast would be looking for [ [0, 0, 1], [1, 0, 0], [0,1,0], [0, 1, 0], [1, 0,0], [0,0,1]].

you can use either Y = xc.NormalTerciles().fit(Y).transform(Y) , which uses a normal distribution to classify data into the tercile categories according to [0, 0.33], [0.34, 0.66], [0.67, 1], or xc.RankedTerciles().fit(Y).tranform(Y) to use a ranked approach, skipping the normality assumption.

eMultivariateLogisticRegression is a simple estimator, rather than a prepared MME class. you should use pMultivariateELR instead if you want this all to be taken care of for you.

3/ is your predict like predict_proba here?

Yes

4/ I find it unintuitive to to set a y_feature_dim, doesnt usually only X has features?

in probabilistic machine learning, it is common for things to be 'one-hot-encoded' like I described above, and so Y does have features in that case.

5/ I find it unintuitive to to set a x_sample_dim and y_sample_dim, what about just sample_dim to be required on both?

the initial motivation was to make if flexible to use with different NETCDF data, with minimal preprocessing; but I do see your point. I would like to implement something that tries to guess common dimension names (ie, X, LON, Lon, Longitude, long all mean the same thing generally)

6/ could _sample_dim also be a list and xcast stacks, does its stuff and unstacks again?
It could in theory, but I don't have plans to implement this any time soon

7/ Did I get the meaning of variable in the return right as category? whays ND for?
yes, you are right in the above as to the category dimension. ND is used for "Non-Deterministic" models, where things are randomly initialized and require many parallel runs to get reproducible/meaningful results. Only some classes implement Non-determinism, so the ones that don't end up with ND= 1.

8/ you require N=4-5 dim arrays. would be nice to lift that restrict allowing from N=1 to N=higher with dimensions being broadcasted

this would be nice; and potentially some day it can be added, but can also result in .fit's that take hours upon hours

Generally, I am currently working on re-doing the documentation and making more examples and walktrhough guides.

@aaronspring
Copy link
Author

thanks for the reply. looking forward to an eLR example, couldnt get it running even with hot-encoding.

understand 4/ and 5/ now better

@kjhall01
Copy link
Owner

I'll provide a similar example tomorrow, in the mean time I'd suggest dropping the coordinates on your inputs that aren't dimensions

@kjhall01
Copy link
Owner

kjhall01 commented Apr 4, 2022

Hey Aaron,

Sorry for the delay on this; work is hectic. I'm going to close this issue because I'm currently re-doing all of the documentation and examples anyway.
I'll comment when I have an example doing probabilistic forecasting.

@kjhall01 kjhall01 closed this as completed Apr 4, 2022
@kjhall01
Copy link
Owner

kjhall01 commented Apr 4, 2022

kjhall01.github.io/xcast

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants