Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linear regression scaling with nans #137

Merged
merged 3 commits into from
Jul 3, 2018
Merged

Conversation

wpreimes
Copy link
Member

@wpreimes wpreimes commented Jun 4, 2018

In this implementation, when there are nans in the src or ref series, these lines are removed from the data frame (pandas dropna) (as stats.linreg needs arrays of equal lengths), when the correction params are found (slope, inter) they are applied back to the full array (until now stats.linreg raised an error). Bias correction will probably be worse if there are a lot if nans, but the function should not fail anymore.
@cpaulik : You think this is ok?

@@ -172,8 +172,10 @@ def linreg(src, ref):
dataset scaled using linear regression
"""

df = pd.DataFrame(data={'src': src, 'ref': ref}).dropna()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the linreg and similar functions as pure functions that do not care about the input a lot. Any user can do removing of nan values outside of this function. A general solution like this is problematic since there can always be custom no data values like e.g. -99999 that we can not remove automatically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. makes sense, then we could - as you already suggested - try to split it up in 2 functions:

  • linreg_params(src, ref) -> returns slope and inter correction values (for a common subset of the data that I want to scale, must be done beforehand, will fail if there are NaNs )
  • linreg_adjust(src, slope, inter) -> applies slope, inter on (all, passed) values, returns bias-corrected values.

Then one could use them separately if src and ref don't match to not lose values before calling the scaling function (and having not all values scaled). And linreg(src, ref) would combine them (as already by now, but would still fail with NaNs in src and ref (or of course if they have different lengths)). But seems quite complicated, I would just use scipy then from the start :-)
But we could also identify NaNs in the higher-level scale() function (eg. add another parameter), that would handle all that when calling linreg() or mean_std() then in scale(). Can't say much about the CDF matching though...
Don't know if this makes sense to you, I just found the mean_std and linreg scaling from pytesmo not that useful for my applications until now for the described reasons (I basically never have continuous/matching TS).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can split it into two functions.

Handling NaN values in a general way is too difficult, I would let the user do that.

For getting matching time series I would just use the temp_matching module which will drop NaN as defined by pandas.

@wpreimes
Copy link
Member Author

wpreimes commented Jun 19, 2018

i split the linreg function to fix #135 as discussed in the issue

@cpaulik
Copy link
Collaborator

cpaulik commented Jun 28, 2018

That looks good. Could you update the CHANGES.rst? Then we can merge it. Just add a new line with unreleased on top of the changelog file.

@cpaulik cpaulik merged commit 0267d90 into TUW-GEO:master Jul 3, 2018
@wpreimes wpreimes deleted the nan_linreg branch July 7, 2018 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants