Documentation Mixup #34

ThomasMGeo · 2022-08-12T18:52:17Z

I am interested in using xcast for a few projects, but still trying to get my arms around the api and generally xarray I/O with xcast. Attached is a notebook (zipped) where I ran into some issues. Happy to set up a meeting too if that is best.

-Thomas
xcast_demo.ipynb.zip

kjhall01 · 2022-08-13T02:49:01Z

Hey Thomas!

Glad you’re interested in Xcast- hopefully I can clear up some issues for you here, if not I’m definitely happy to set up a time to meet!

Based on your notebook, here are some points:

1/ XCast needs input xr.DataArrays to be four dimensional; each representing one of latitude/longitude/samples/features. However since so many different sets of standards for the NAMES of these dimensions exist, Xcast abstracts them out by taking a mapping of {dimension_type: dimension_name, …} in its keyword arguments (ie, in .fit or .predict).

however, in order to accommodate the most common names (X/Y/T/M/S/lat/lon/time) without explicitly passing them all the time, Xcast implements a dimension name guessing heuristic.

So for your first issue, using the xarray tutorial data, you get:

Detection Faild - Duplicated Coordinate:
LATITUDE: lat
LONGITUDE: lon
SAMPLE: time
FEATURE: time

so you can see that the dimension named “variable”, which In your case maps to the feature dimension, was not successfully guessed; that’s not one I had anticipated (it’s not the most complete feature yet). Lat/lon/time WERE ones I expected to come up, Which is why the are guessed by default.

To constrain the guessing process, you can pass keyword arguments specifying the names of the lat/lon/sample/feature dims on your x and y data arrays In your calls to fit and predict.

An example for your case would be:

rf.fit(x, y, x_feature_dim=“variable”, y_feature_dim=“variable”)

and that should fix that problem. Alternately, you could rename your dimensions to X/Y/T/M, and skip the keyword args.

2/ it seems the next issue is the prediction produced by ensemble mean. The prediction here is just the mean across the feature dimension, with maybe some preprocessing, I would have to check- but either way, it is a simple enough task that it can be done with pure xarray- so the resulting underlying data is a lazily evaluated dask array- it doesn’t show up when you print(pred) since it hasnt been calculated yet - just queued up for dask to do once it’s needed. If you were to try to plot the data, or access the values attribute or use the .load() method on pred, it would compute the average and then look more like a normal xarray dataarray. This looks like a pretty normal behavior to me- maybe try plotting the data and see what happens. I’ll try to recreate your notebook next week.

Note that the pred variable is 5D - as all predictions will be- in order to accommodate repeated runs of stochastic methods, Xcast predictions have a fifth dimension called “ND” (short for non-determinism). For deterministic methods you can safely remove that dimension with mean or by dropping that dimension.

I hope this helps! Please follow up with any more questions. I’ll try to update the docs this week.

ThomasMGeo · 2022-08-15T18:58:42Z

Hi Kyle,

Thanks for the tip! That helped, I ended up using a single quote:

rf = xc.rRandomForest() 

rf.fit(x, y, 
    x_feature_dim= 'variable', 
    y_feature_dim= 'variable')

Still working through some other (user :)) issues, but that did solve a big one. Thanks!

ThomasMGeo · 2022-08-18T23:24:21Z

Hi @kjhall01 !

Thanks for the call today, think I got a lot closer. Attached is the notebook, getting stuck on scoring.
xcast_demo 2.ipynb.zip
.

kjhall01 · 2022-08-19T02:48:49Z

Hey Thomas,

It was great talking with you! I've had a look at your notebook - the hard part here is that mean squared error is not a built-in skill metric for xcast- you need to use the xcast.metric decorator to expand the scikit learn function to operate gridpoint-wise. The idea is to apply it separately at every grid point, to get an idea of the spatial distribution of the skill, rather than stacking everything into as single dimension and ending up with one number. I'll email you a new version of the notebook directly, but am attaching a screenshot here for posterity!

Note that, if you were using like, Pearson correlation or another built-in function, you could just go directly to the xcast function xc.Pearson(preds.mean('ND'), mltestY , ...)

Also - just taking the mean of the 'ND' dimension is appropriate here!

ThomasMGeo · 2022-08-19T15:19:00Z

This looks great! thanks!

What's the reasoning behind taking the mean of ND?

kjhall01 · 2022-08-19T15:25:07Z

For stochastically initialized ML methods like ELM / ANN / RF, its good practice to use the mean of an ensemble of many randomly initialized runs rather than just one model - so XCast comes with the capacity to run the same ML method many times ( ND standing for 'non-determinism' although i think that is actually a bit of a misnomer) and always returns predictions with one additional dimension reflecting these multiple randomly initialized model fittings.

For deterministic methods like MLR, and methods which are themselves already ensembles (like Random Forest), the ND part of XCast is unnecessary, but XCast still returns an ND dimension of size one on predictions.

So any predictions from Xcast will be latitude x longitude x samples x features x ND - but the the 'metric' decorator is designed for only four dimensions - latitude x longitude x samples x features, on both predictions & observations. So you need to remove the ND dimension by taking the mean.

Does that make sense?

ThomasMGeo · 2022-08-19T15:29:08Z

Ah yes, that does make sense! Thanks for explaining that.

kjhall01 closed this as completed Aug 13, 2022

kjhall01 reopened this Aug 13, 2022

ThomasMGeo closed this as completed Aug 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation Mixup #34

Documentation Mixup #34

ThomasMGeo commented Aug 12, 2022

kjhall01 commented Aug 13, 2022 •

edited

ThomasMGeo commented Aug 15, 2022 •

edited

ThomasMGeo commented Aug 18, 2022

kjhall01 commented Aug 19, 2022

ThomasMGeo commented Aug 19, 2022

kjhall01 commented Aug 19, 2022

ThomasMGeo commented Aug 19, 2022

Documentation Mixup #34

Documentation Mixup #34

Comments

ThomasMGeo commented Aug 12, 2022

kjhall01 commented Aug 13, 2022 • edited

ThomasMGeo commented Aug 15, 2022 • edited

ThomasMGeo commented Aug 18, 2022

kjhall01 commented Aug 19, 2022

ThomasMGeo commented Aug 19, 2022

kjhall01 commented Aug 19, 2022

ThomasMGeo commented Aug 19, 2022

kjhall01 commented Aug 13, 2022 •

edited

ThomasMGeo commented Aug 15, 2022 •

edited