Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EconomicRatio PostProcessor Enhancements #1763

Merged
merged 25 commits into from
Feb 9, 2022

Conversation

dgarrett622
Copy link
Collaborator

@dgarrett622 dgarrett622 commented Feb 3, 2022


Pull Request Description

What issue does this change request address? (Use "#" before the issue to link it, i.e., #42.)

#1752

What are the significant changes in functionality due to this change request?
  • Metrics from BasicStatistics are requestable using EconomicRatio alone
  • Standard error for percentile (in BasicStatistics) implemented including addition of percent to percentile standard error variable name for clarity - e.g., percentile_10_ste_x
  • Standard error for value at risk (in EconomicRatio) implemented
  • Added threshold to variable names valueAtRisk (including standard error), expectedShortfall, gainLossRatio, and sortinoRatio for clarity (EconomicRatio) - e.g., VaR_0.05_NPV, VaR_0.05_ste_NPV, es_0.10_NPV, gainLoss_zero_NPV, sort_median_NPV

For Change Control Board: Change Request Review

The following review must be completed by an authorized member of the Change Control Board.

  • 1. Review all computer code.
  • 2. If any changes occur to the input syntax, there must be an accompanying change to the user manual and xsd schema. If the input syntax change deprecates existing input files, a conversion script needs to be added (see Conversion Scripts).
  • 3. Make sure the Python code and commenting standards are respected (camelBack, etc.) - See on the wiki for details.
  • 4. Automated Tests should pass, including run_tests, pylint, manual building and xsd tests. If there are changes to Simulation.py or JobHandler.py the qsub tests must pass.
  • 5. If significant functionality is added, there must be tests added to check this. Tests should cover all possible options. Multiple short tests are preferred over one large test. If new development on the internal JobHandler parallel system is performed, a cluster test must be added setting, in XML block, the node <internalParallel> to True.
  • 6. If the change modifies or adds a requirement or a requirement based test case, the Change Control Board's Chair or designee also needs to approve the change. The requirements and the requirements test shall be in sync.
  • 7. The merge request must reference an issue. If the issue is closed, the issue close checklist shall be done.
  • 8. If an analytic test is changed/added is the the analytic documentation updated/added?
  • 9. If any test used as a basis for documentation examples (currently found in raven/tests/framework/user_guide and raven/docs/workshop) have been changed, the associated documentation must be reviewed and assured the text matches the example.

@dgarrett622
Copy link
Collaborator Author

I think tests will need to be modified or added for BasicStatistics, but I don't know which ones. I did not find tests for EconomicRatio, so suggestions on good tests to write are appreciated.

Copy link
Collaborator

@PaulTalbot-INL PaulTalbot-INL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Nice implementation of the percentile STE and callback to BasicStatistics.

A couple comments and one change suggestion to review.

framework/Models/PostProcessors/BasicStatistics.py Outdated Show resolved Hide resolved
framework/Models/PostProcessors/EconomicRatio.py Outdated Show resolved Hide resolved
@PaulTalbot-INL
Copy link
Collaborator

I think tests will need to be modified

You can find the BasicStatistics with standard error tests in raven/tests/framework/PostProcessors/BasicStatistics, see for example ste_mc.xml. You should be able to add percentile STE to that test, probably.

For EconomicRatio, I sent an email seeing if anyone knows where one is; it might need to be created new.

@PaulTalbot-INL
Copy link
Collaborator

Code changes are good, we just need a test to cover EconomicRatio.

I note that several tests using BasicStatistics are failing due to the added percentile STE column, so we'll have to regold those after checking they're the same otherwise.

@PaulTalbot-INL
Copy link
Collaborator

Notification to plugin developers should be sent noting that use of the BasicStatistics is producing more information than it used to, which may fail some tests using the output from that postprocessor.

@dgarrett622
Copy link
Collaborator Author

@wangcj05 do we want to send an email to the user group about the new percentile standard error?

@PaulTalbot-INL PaulTalbot-INL merged commit a8f779f into idaholab:devel Feb 9, 2022
@wangcj05 wangcj05 added the RAVENv2.2 for RAVENv2.2 Release label Feb 9, 2022
@wangcj05
Copy link
Collaborator

wangcj05 commented Feb 9, 2022

@dgarrett622 Email to the user group is optional, we are required to send the email when there is a defect in the code that will cause incorrect results.

Comment on lines +1011 to +1026
kde = stats.gaussian_kde(group.values, weights=targWeight)
val = calculatedPercentiles[target].sel(**{'percent': pct, self.pivotParameter: label}).values
subPercentileSte.append(factor/kde(val)[0])
percentileSte.append(subPercentileSte)
da = xr.DataArray(percentileSte, dims=('percent', self.pivotParameter), coords={'percent': percent, self.pivotParameter: self.pivotValue})
percentileSteSet[target] = da
else:
calcPercentiles = calculatedPercentiles[target]
if targDa.values.min() == targDa.values.max():
# distribution is a delta function, so no KDE construction
percentileSte = list(np.zeros(calcPercentiles.shape))
else:
# get KDE
kde = stats.gaussian_kde(targDa.values, weights=targWeight)
factor = np.sqrt(np.array(percent)*(1.0 - np.array(percent))/en)
percentileSte = list(factor/kde(calcPercentiles.values))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgarrett622 I have a question regarding the calculation of standard error for percentile, from the reference (see the following snapshot)
image

The standard normal distribution is used instead of KDE of real data, could you double check with the reference, and let me know what do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangcj05 That reference gives an equation for the asymptotic standard error estimate of a percentile assuming the random variable belongs to a normal distribution. There is a more general form of this asymptotic estimate that can be found in various references such as:

https://www.medicine.mcgill.ca/epidemiology/hanley/bios601/DescriptiveStatistics/Var(percentile).pdf

Here is a snapshot of the more general asymptotic estimate from the reference above which I implemented:
percentile_ste

The KDE tries to get the real distribution from the data without making the assumption that the data are normally distributed. The percentile, its standard error, and the KDE perform better with more samples, but that was always going to be a limitation when requesting a percentile.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgarrett622 Thanks for the reference, as you mentioned and also concluded by the paper, the limitation is the number of samples. I would like this information to be included in the user manual.
image

I do have some question about your implementation. In the code, you are using gaussian_kde which is based on gaussian kernels. In this case, it seems the two equations from the different papers are equivalent. Basically, you can transform your fitted PDF to standard normal distribution, and you will get the same equation. If this is the case, I prefer to use the standard normal distribution to compute the error. The reason for that is: you do not need to fit your data with gaussian_kde model for target variable. Especially when you have a larger sample size, the fitting will take more time to process.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly then that you should only use the generic formula when you have a great many samples, which will take forever due to creating the kernel?

I do agree that generally creating the kernel is costly and we've often found ways to work around it in RAVEN for that reason. I did not check the time impact of the percentile_ste addition; we really need some timing tests in RAVEN.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangcj05 The two equations are not equivalent. Using a Gaussian kernel is not the same as fitting a Gaussian to the data. The Gaussian kernel is like smoothing or interpolating to fit a distribution to the real data. Here is an example of using a Gaussian kernel to find a distribution that is definitely not Gaussian:

https://gsalvatovallverdu.gitlab.io/python/kernel_density_estimation/

With the distribution fit to the data becoming:
gaussian_kde

The number of samples is an issue here, both in accuracy and computational cost. Determining the percentile itself requires many samples, especially if the requested percentile is in the tail of the distribution. The more samples you take, the more time it takes to compute the percentile since we do sorting on the data itself. The limitation in the paper applies to the calculation of the percentile itself and the standard error calculation, whether the standard normal formulation or the KDE method is used since it is based on number of samples. However, the standard normal formulation assumes that the random variable is distributed normally while the KDE method makes no assumption about the distribution.

An alternate method would be to use bootstrapped samples, but again there are concerns about computational time with that method.

@dgarrett622 dgarrett622 deleted the EconomicRatio_enhance branch April 19, 2022 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RAVENv2.2 for RAVENv2.2 Release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[TASK] Request metrics/statistics from BasicStatistics when using EconomicRatio
3 participants