New CDF matching implementation #259

s-scherrer · 2022-01-09T16:50:37Z

This is a major rewrite of the CDF matching implementation currently used in pytesmo. It implements a few bugfixes and performance improvements (see #255) and provides a new interface that separates the calculation of the percentile values from the scaling step.

It replaces the deprecated method scaling.cdf_match with a method that has a similar functionality and interface as scaling.cdf_beta_match, the currently recommended method (and removes thereby the old implementation of scaling.cdf_match).

I assume it would be best to also remove the deprecated method lin_cdf_match, but it's currently still in use in the validation framework.

@wpreimes, @pstradio, what would still be necessary to replace the lin_cdf_match method in the validation framework with the new implementation? I think it would be best if we only provide a single method for CDF matching.

Also, @pstradio, do you miss any other features that prevents you from using this implementation in pytesmo?

src/pytesmo/scaling.py

pstradio · 2022-01-10T08:32:46Z

Regarding the lin_cdf_match method, although it is the default in the validation framework, it is overridden in QA4SM (which does not display an option for linear matching anyways). Therefore I don't see any additional requisite to replace the default there.

I don't see that any features are missing (at least for using this in CCI). As Wolfi mentioned in #255, a future addition could be to also add dynamic (seasonal, doy, etc.) methods, but I would leave it to a separate PR.

Does this answer the questions?

s-scherrer · 2022-01-10T09:16:14Z

Regarding the lin_cdf_match method, although it is the default in the validation framework, it is overridden in QA4SM (which does not display an option for linear matching anyways). Therefore I don't see any additional requisite to replace the default there.

Would it then be possible to remove the lin_cdf_match method and replace it with the new implementation in the validation framework?

pstradio · 2022-01-10T12:42:25Z

Regarding the lin_cdf_match method, although it is the default in the validation framework, it is overridden in QA4SM (which does not display an option for linear matching anyways). Therefore I don't see any additional requisite to replace the default there.

Would it then be possible to remove the lin_cdf_match method and replace it with the new implementation in the validation framework?

Sure. Do you want me to do this in the present PR?

s-scherrer · 2022-01-10T12:44:13Z

No, I can do it.

wpreimes · 2022-01-10T13:00:23Z

Just to be sure: qa4sm uses the cdf_beta_match method https://github.com/awst-austria/qa4sm/blob/master/validator/models/validation_run.py#L25 --> when changing the name in pytesmo we have to change it in qa4sm as well. I think it would be good to only provide a single cdf matching method (or two, maybe one with linear and one with non-linear interpolation? I remember that there were some issues with the non-linear method especially at the edges, but maybe that is fixed now?)
I thought about changing the default method some time ago, because of the deprecation warnings in pytesmo, but as the cdf_beta_match method changed the scaling results (slightly) in my tests, I thought someone with a better overview should decide :-)
tldr: overall I'm fine with changing it

s-scherrer · 2022-01-10T13:03:58Z

The implementation I am using at the moment only allows linear interpolation, but this can easily be changed by providing an additional setting for the interpolation spline order.

s-scherrer · 2022-02-01T16:59:51Z

I ran into a few more issues:

Edge scaling with unequal amounts of data in the source and reference bins

In the edge scaling, the data from the lowest and highest bin for both source (x) and reference (y) are selected. If there are multiple values directly on the bin edges (happens if the data is only on a few discrete levels), the number of data in a bin might differ for x and y.

In the current implementation the first/last n values of both bins are taken, where n is the minimum of both bin sizes. This happens before the data is sorted.

pytesmo/src/pytesmo/utils.py

Lines 345 to 348 in 8dacef7

 n = min(len(x), len(y)) 

 x, y = x[:n], y[:n] 

 x, y = np.sort(x), np.sort(y) 

 slope, res, rank, s = np.linalg.lstsq(x.reshape(-1, 1), y, rcond=None)

In the new implementation, the data are sorted beforehand. This leads to a low/high bias compared to the "random" selection in the current implementation, because this way we make sure to select the n lowest/highest values of each bin. In one test case this even lead to a very flat upper CDF edge, because there were only a few values in the upper x bin, meaning that only a few outliers in y were selected for the linear regression.

~~To overcome these inconsistencies I decided that it might be best to set n = number of data in lowest/highest y-bin, because the edge scaling should scale the edges of the y-CDF.~~

The even better solution would probably resampling of the x data according to its empirical CDF and then perform the linear regression. This is implemented in the new version.

Beta-matching

The old implementation fitted the percentiles for x to a beta distribution in case of non-unique values. The non-unique percentiles in y where removed with an interpolation approach.

In one of the test cases this lead to deviations between the original CDF and the CDF of the matched data (green vs. blue line below). This did not happen in the lin_cdf_match implementation without the beta matching (orange line).

I therefore changed the code to also use the interpolation for non-unique values in the x-percentiles.

… new_cdf_matching

pstradio · 2023-03-21T16:09:59Z

@s-scherrer are you planning to merge this one?

s-scherrer and others added 4 commits December 14, 2021 12:32

bugfix

047d052

started work on new cdf matching implementation

52034d3

use new CDF matching implementation

128fdb9

added scikit-learn to environment

0da1c4e

pstradio reviewed Jan 10, 2022

View reviewed changes

src/pytesmo/scaling.py Outdated Show resolved Hide resolved

s-scherrer added 9 commits February 1, 2022 18:06

removed beta matching from new CDF matching implementation

d5eb242

adapted tests for new implementation

1713f8c

Merge branch 'master' into new_cdf_matching

ac78a9f

changed yapf settings

411fb04

Merge branch 'new_cdf_matching' of github.com:s-scherrer/pytesmo into…

b69aeac

… new_cdf_matching

updated docs

73fae9e

removed lin_cdf_match and associated code

11c2c68

added a few tests for coverage

7fd6ab5

fixed tests

b367557

s-scherrer marked this pull request as ready for review February 2, 2022 19:00

s-scherrer added 3 commits February 3, 2022 21:54

some smaller fixes

56638e0

catching all-nan input

c991857

handle nan inputs for prediction

37e59b9

s-scherrer closed this Nov 16, 2022

s-scherrer reopened this Nov 16, 2022

s-scherrer closed this Mar 28, 2023

s-scherrer reopened this Mar 28, 2023

s-scherrer added 2 commits March 28, 2023 14:30

Merge branch 'master' into new_cdf_matching

b2d3d59

updated environment

bfee33d

s-scherrer merged commit 98e5421 into TUW-GEO:master Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New CDF matching implementation #259

New CDF matching implementation #259

s-scherrer commented Jan 9, 2022

pstradio commented Jan 10, 2022

s-scherrer commented Jan 10, 2022

pstradio commented Jan 10, 2022

s-scherrer commented Jan 10, 2022

wpreimes commented Jan 10, 2022

s-scherrer commented Jan 10, 2022

s-scherrer commented Feb 1, 2022 •

edited

Loading

pstradio commented Mar 21, 2023

New CDF matching implementation #259

New CDF matching implementation #259

Conversation

s-scherrer commented Jan 9, 2022

pstradio commented Jan 10, 2022

s-scherrer commented Jan 10, 2022

pstradio commented Jan 10, 2022

s-scherrer commented Jan 10, 2022

wpreimes commented Jan 10, 2022

s-scherrer commented Jan 10, 2022

s-scherrer commented Feb 1, 2022 • edited Loading

Edge scaling with unequal amounts of data in the source and reference bins

Beta-matching

pstradio commented Mar 21, 2023

s-scherrer commented Feb 1, 2022 •

edited

Loading