Linear regression scaling with nans #137

wpreimes · 2018-06-04T09:58:30Z

In this implementation, when there are nans in the src or ref series, these lines are removed from the data frame (pandas dropna) (as stats.linreg needs arrays of equal lengths), when the correction params are found (slope, inter) they are applied back to the full array (until now stats.linreg raised an error). Bias correction will probably be worse if there are a lot if nans, but the function should not fail anymore.
@cpaulik : You think this is ok?

cpaulik · 2018-06-14T11:34:29Z

pytesmo/scaling.py

@@ -172,8 +172,10 @@ def linreg(src, ref):
 dataset scaled using linear regression
 """

+ df = pd.DataFrame(data={'src': src, 'ref': ref}).dropna()


I see the linreg and similar functions as pure functions that do not care about the input a lot. Any user can do removing of nan values outside of this function. A general solution like this is problematic since there can always be custom no data values like e.g. -99999 that we can not remove automatically.

ok. makes sense, then we could - as you already suggested - try to split it up in 2 functions:

linreg_params(src, ref) -> returns slope and inter correction values (for a common subset of the data that I want to scale, must be done beforehand, will fail if there are NaNs )

linreg_adjust(src, slope, inter) -> applies slope, inter on (all, passed) values, returns bias-corrected values.

Then one could use them separately if src and ref don't match to not lose values before calling the scaling function (and having not all values scaled). And linreg(src, ref) would combine them (as already by now, but would still fail with NaNs in src and ref (or of course if they have different lengths)). But seems quite complicated, I would just use scipy then from the start :-)
But we could also identify NaNs in the higher-level scale() function (eg. add another parameter), that would handle all that when calling linreg() or mean_std() then in scale(). Can't say much about the CDF matching though...
Don't know if this makes sense to you, I just found the mean_std and linreg scaling from pytesmo not that useful for my applications until now for the described reasons (I basically never have continuous/matching TS).

Yes we can split it into two functions.

Handling NaN values in a general way is too difficult, I would let the user do that.

For getting matching time series I would just use the temp_matching module which will drop NaN as defined by pandas.

…rection parameters

wpreimes · 2018-06-19T14:20:20Z

i split the linreg function to fix #135 as discussed in the issue

cpaulik · 2018-06-28T15:44:36Z

That looks good. Could you update the CHANGES.rst? Then we can merge it. Just add a new line with unreleased on top of the changelog file.

add linreg scaling with nans, add test

a7d4468

cpaulik reviewed Jun 14, 2018

View reviewed changes

split linreg scaling into 2 function for calculating and applying cor…

12aa087

…rection parameters

Update CHANGES.rst

42010c8

cpaulik merged commit 0267d90 into TUW-GEO:master Jul 3, 2018

wpreimes deleted the nan_linreg branch July 7, 2018 12:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linear regression scaling with nans #137

Linear regression scaling with nans #137

wpreimes commented Jun 4, 2018

cpaulik Jun 14, 2018

wpreimes Jun 14, 2018

cpaulik Jun 14, 2018

wpreimes commented Jun 19, 2018 •

edited

Loading

cpaulik commented Jun 28, 2018

Linear regression scaling with nans #137

Linear regression scaling with nans #137

Conversation

wpreimes commented Jun 4, 2018

cpaulik Jun 14, 2018

Choose a reason for hiding this comment

wpreimes Jun 14, 2018

Choose a reason for hiding this comment

cpaulik Jun 14, 2018

Choose a reason for hiding this comment

wpreimes commented Jun 19, 2018 • edited Loading

cpaulik commented Jun 28, 2018

wpreimes commented Jun 19, 2018 •

edited

Loading