Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation Mixup #34

Closed
ThomasMGeo opened this issue Aug 12, 2022 · 7 comments
Closed

Documentation Mixup #34

ThomasMGeo opened this issue Aug 12, 2022 · 7 comments

Comments

@ThomasMGeo
Copy link
Contributor

Hi @kjhall01 !

I am interested in using xcast for a few projects, but still trying to get my arms around the api and generally xarray I/O with xcast. Attached is a notebook (zipped) where I ran into some issues. Happy to set up a meeting too if that is best.

-Thomas
xcast_demo.ipynb.zip

@kjhall01
Copy link
Owner

kjhall01 commented Aug 13, 2022

Hey Thomas!

Glad you’re interested in Xcast- hopefully I can clear up some issues for you here, if not I’m definitely happy to set up a time to meet!

Based on your notebook, here are some points:

1/ XCast needs input xr.DataArrays to be four dimensional; each representing one of latitude/longitude/samples/features. However since so many different sets of standards for the NAMES of these dimensions exist, Xcast abstracts them out by taking a mapping of {dimension_type: dimension_name, …} in its keyword arguments (ie, in .fit or .predict).

however, in order to accommodate the most common names (X/Y/T/M/S/lat/lon/time) without explicitly passing them all the time, Xcast implements a dimension name guessing heuristic.

So for your first issue, using the xarray tutorial data, you get:

Detection Faild - Duplicated Coordinate:
LATITUDE: lat
LONGITUDE: lon
SAMPLE: time
FEATURE: time

so you can see that the dimension named “variable”, which In your case maps to the feature dimension, was not successfully guessed; that’s not one I had anticipated (it’s not the most complete feature yet). Lat/lon/time WERE ones I expected to come up, Which is why the are guessed by default.

To constrain the guessing process, you can pass keyword arguments specifying the names of the lat/lon/sample/feature dims on your x and y data arrays In your calls to fit and predict.

An example for your case would be:

rf.fit(x, y, x_feature_dim=“variable”, y_feature_dim=“variable”)

and that should fix that problem. Alternately, you could rename your dimensions to X/Y/T/M, and skip the keyword args.

2/ it seems the next issue is the prediction produced by ensemble mean. The prediction here is just the mean across the feature dimension, with maybe some preprocessing, I would have to check- but either way, it is a simple enough task that it can be done with pure xarray- so the resulting underlying data is a lazily evaluated dask array- it doesn’t show up when you print(pred) since it hasnt been calculated yet - just queued up for dask to do once it’s needed. If you were to try to plot the data, or access the values attribute or use the .load() method on pred, it would compute the average and then look more like a normal xarray dataarray. This looks like a pretty normal behavior to me- maybe try plotting the data and see what happens. I’ll try to recreate your notebook next week.

Note that the pred variable is 5D - as all predictions will be- in order to accommodate repeated runs of stochastic methods, Xcast predictions have a fifth dimension called “ND” (short for non-determinism). For deterministic methods you can safely remove that dimension with mean or by dropping that dimension.

I hope this helps! Please follow up with any more questions. I’ll try to update the docs this week.

@kjhall01 kjhall01 reopened this Aug 13, 2022
@ThomasMGeo
Copy link
Contributor Author

ThomasMGeo commented Aug 15, 2022

Hi Kyle,

Thanks for the tip! That helped, I ended up using a single quote:

rf = xc.rRandomForest() 

rf.fit(x, y, 
    x_feature_dim= 'variable', 
    y_feature_dim= 'variable')

Still working through some other (user :)) issues, but that did solve a big one. Thanks!

@ThomasMGeo
Copy link
Contributor Author

Hi @kjhall01 !

Thanks for the call today, think I got a lot closer. Attached is the notebook, getting stuck on scoring.
xcast_demo 2.ipynb.zip
.

@kjhall01
Copy link
Owner

Hey Thomas,

It was great talking with you! I've had a look at your notebook - the hard part here is that mean squared error is not a built-in skill metric for xcast- you need to use the xcast.metric decorator to expand the scikit learn function to operate gridpoint-wise. The idea is to apply it separately at every grid point, to get an idea of the spatial distribution of the skill, rather than stacking everything into as single dimension and ending up with one number. I'll email you a new version of the notebook directly, but am attaching a screenshot here for posterity!

Note that, if you were using like, Pearson correlation or another built-in function, you could just go directly to the xcast function xc.Pearson(preds.mean('ND'), mltestY , ...)

Also - just taking the mean of the 'ND' dimension is appropriate here!

Screen Shot 2022-08-18 at 10 44 40 PM

@ThomasMGeo
Copy link
Contributor Author

This looks great! thanks!

What's the reasoning behind taking the mean of ND?

@kjhall01
Copy link
Owner

For stochastically initialized ML methods like ELM / ANN / RF, its good practice to use the mean of an ensemble of many randomly initialized runs rather than just one model - so XCast comes with the capacity to run the same ML method many times ( ND standing for 'non-determinism' although i think that is actually a bit of a misnomer) and always returns predictions with one additional dimension reflecting these multiple randomly initialized model fittings.

For deterministic methods like MLR, and methods which are themselves already ensembles (like Random Forest), the ND part of XCast is unnecessary, but XCast still returns an ND dimension of size one on predictions.

So any predictions from Xcast will be latitude x longitude x samples x features x ND - but the the 'metric' decorator is designed for only four dimensions - latitude x longitude x samples x features, on both predictions & observations. So you need to remove the ND dimension by taking the mean.

Does that make sense?

@ThomasMGeo
Copy link
Contributor Author

Ah yes, that does make sense! Thanks for explaining that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants