-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
example request: LogisticRegression s2s-ai-challenge
#7
Comments
X_train
<xarray.Dataset>
Dimensions: (lead_time: 2, year: 18, week: 53, X: 240, Y: 121)
Coordinates:
* lead_time (lead_time) int64 1 2
* year (year) int64 2000 2001 2002 2003 2004 ... 2014 2015 2016 2017
* week (week) int64 0 1 2 3 4 5 6 7 8 9 ... 44 45 46 47 48 49 50 51 52
* X (X) int64 0 1 2 3 4 5 6 7 8 ... 232 233 234 235 236 237 238 239
* Y (Y) int64 0 1 2 3 4 5 6 7 8 ... 113 114 115 116 117 118 119 120
Data variables:
t2m (lead_time, year, week, X, Y) float64 0.3526 0.9743 ... 0.8904
tp (lead_time, year, week, X, Y) float64 0.7562 0.6563 ... 0.9506
msl (lead_time, year, week, X, Y) float64 0.7783 0.03226 ... 0.361
y_train
<xarray.Dataset>
Dimensions: (lead_time: 2, year: 18, week: 53, X: 240, Y: 121)
Coordinates:
* lead_time (lead_time) int64 1 2
* year (year) int64 2000 2001 2002 2003 2004 ... 2014 2015 2016 2017
* week (week) int64 0 1 2 3 4 5 6 7 8 9 ... 44 45 46 47 48 49 50 51 52
* X (X) int64 0 1 2 3 4 5 6 7 8 ... 232 233 234 235 236 237 238 239
* Y (Y) int64 0 1 2 3 4 5 6 7 8 ... 113 114 115 116 117 118 119 120
Data variables:
t2m (lead_time, year, week, X, Y) float64 1.0 2.0 1.0 ... 0.0 0.0 2.0
tp (lead_time, year, week, X, Y) float64 2.0 1.0 1.0 ... 1.0 1.0 2.0
msl (lead_time, year, week, X, Y) float64 2.0 0.0 0.0 ... 1.0 1.0 1.0
emlr = xc.eMultivariateLogisticRegression()
emlr.fit(X_train.stack(S=('year','week')).isel(lead_time=0).to_array(),
y_train.isel(lead_time=0)[['t2m']].to_array().stack(S=('year','week')),
x_feature_dim='variable', x_sample_dim='S',
y_sample_dim='S', y_feature_dim='variable')
emlr.predict(X_train.stack(S=('year','week')).isel(lead_time=0).to_array(),
x_feature_dim='variable', x_sample_dim='S').rename({'variable':'category'})#.unstack()
<xarray.DataArray (Y: 121, X: 240, ND: 1, S: 954, category: 3)>
array([[[[[0.33, 0.34, 0.33],
[0.33, 0.34, 0.33],
[0.33, 0.34, 0.33], My first usage observations:
|
Hi Aaron, 1/ If I want to predict multiple lead_times in a single iteration, would I just stack lead_time and X or Y? I dont know how to do it off the top of my head, but I would do something akin to np.hstack along the X-axis if you want to do multiple lead times at a time. That way instead of (X, Y, M, S, L) you have (X*L, Y, M, S, L) and then you can un-stack again to get back the L dimension. Like you seem do have done with the variables of the dataset above. 2/ probably did something wrong, 1/3 results everywhere constrasting It looks like you haven't "one-hot-encoded" your observations here. Like many sklearn classifiers, XCast.eMultivariateLogisticRegression accepts a categorical target vector- for example, with a dataset with time series [AN, BN, NN, NN, BN, AN] xcast would be looking for [ [0, 0, 1], [1, 0, 0], [0,1,0], [0, 1, 0], [1, 0,0], [0,0,1]]. you can use either Y = xc.NormalTerciles().fit(Y).transform(Y) , which uses a normal distribution to classify data into the tercile categories according to [0, 0.33], [0.34, 0.66], [0.67, 1], or xc.RankedTerciles().fit(Y).tranform(Y) to use a ranked approach, skipping the normality assumption. eMultivariateLogisticRegression is a simple estimator, rather than a prepared MME class. you should use pMultivariateELR instead if you want this all to be taken care of for you. 3/ is your predict like predict_proba here? Yes 4/ I find it unintuitive to to set a y_feature_dim, doesnt usually only X has features? in probabilistic machine learning, it is common for things to be 'one-hot-encoded' like I described above, and so Y does have features in that case. 5/ I find it unintuitive to to set a x_sample_dim and y_sample_dim, what about just sample_dim to be required on both? the initial motivation was to make if flexible to use with different NETCDF data, with minimal preprocessing; but I do see your point. I would like to implement something that tries to guess common dimension names (ie, X, LON, Lon, Longitude, long all mean the same thing generally) 6/ could _sample_dim also be a list and xcast stacks, does its stuff and unstacks again? 7/ Did I get the meaning of variable in the return right as category? whays ND for? 8/ you require N=4-5 dim arrays. would be nice to lift that restrict allowing from N=1 to N=higher with dimensions being broadcasted this would be nice; and potentially some day it can be added, but can also result in .fit's that take hours upon hours Generally, I am currently working on re-doing the documentation and making more examples and walktrhough guides. |
thanks for the reply. looking forward to an eLR example, couldnt get it running even with hot-encoding. understand 4/ and 5/ now better |
I'll provide a similar example tomorrow, in the mean time I'd suggest dropping the coordinates on your inputs that aren't dimensions |
Hey Aaron, Sorry for the delay on this; work is hectic. I'm going to close this issue because I'm currently re-doing all of the documentation and examples anyway. |
kjhall01.github.io/xcast |
I love the idea of
xcast
. If I want to predict multiplelead_time
s in a single iteration, would I just stacklead_time
andX
orY
?I think more concrete examples in the documentation would help users.
i.e. something like applying
xcast
tos2s-ai-challenge
mock data like I did in https://gist.github.com/aaronspring/36e112e992e36fba935f73404dbbd3cd and related issue (likely your approach is much more performant because thevectorize=True
essentially loops thexr.apply_ufunc
call, but would be great to check)The text was updated successfully, but these errors were encountered: