-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New CDF matching implementation #259
Conversation
Regarding the I don't see that any features are missing (at least for using this in CCI). As Wolfi mentioned in #255, a future addition could be to also add dynamic (seasonal, doy, etc.) methods, but I would leave it to a separate PR. Does this answer the questions? |
Would it then be possible to remove the |
Sure. Do you want me to do this in the present PR? |
No, I can do it. |
Just to be sure: qa4sm uses the |
The implementation I am using at the moment only allows linear interpolation, but this can easily be changed by providing an additional setting for the interpolation spline order. |
I ran into a few more issues: Edge scaling with unequal amounts of data in the source and reference binsIn the edge scaling, the data from the lowest and highest bin for both source (x) and reference (y) are selected. If there are multiple values directly on the bin edges (happens if the data is only on a few discrete levels), the number of data in a bin might differ for x and y. In the current implementation the first/last n values of both bins are taken, where n is the minimum of both bin sizes. This happens before the data is sorted. Lines 345 to 348 in 8dacef7
In the new implementation, the data are sorted beforehand. This leads to a low/high bias compared to the "random" selection in the current implementation, because this way we make sure to select the n lowest/highest values of each bin. In one test case this even lead to a very flat upper CDF edge, because there were only a few values in the upper x bin, meaning that only a few outliers in y were selected for the linear regression.
The even better solution would probably resampling of the x data according to its empirical CDF and then perform the linear regression. This is implemented in the new version. Beta-matchingThe old implementation fitted the percentiles for x to a beta distribution in case of non-unique values. The non-unique percentiles in y where removed with an interpolation approach. In one of the test cases this lead to deviations between the original CDF and the CDF of the matched data (green vs. blue line below). This did not happen in the lin_cdf_match implementation without the beta matching (orange line). I therefore changed the code to also use the interpolation for non-unique values in the x-percentiles. |
… new_cdf_matching
@s-scherrer are you planning to merge this one? |
This is a major rewrite of the CDF matching implementation currently used in pytesmo. It implements a few bugfixes and performance improvements (see #255) and provides a new interface that separates the calculation of the percentile values from the scaling step.
It replaces the deprecated method
scaling.cdf_match
with a method that has a similar functionality and interface asscaling.cdf_beta_match
, the currently recommended method (and removes thereby the old implementation ofscaling.cdf_match
).I assume it would be best to also remove the deprecated method
lin_cdf_match
, but it's currently still in use in the validation framework.@wpreimes, @pstradio, what would still be necessary to replace the lin_cdf_match method in the validation framework with the new implementation? I think it would be best if we only provide a single method for CDF matching.
Also, @pstradio, do you miss any other features that prevents you from using this implementation in pytesmo?