-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confidence intervals for PAWN #596
Comments
Hi @halla 👋 (I think) Bootstrapping is always fine to do. Might not be the most effective thing to do as in some cases you could instead derive analytical bounds, but done properly it does give you an idea of the variation of your statistic. I did not check your code line by line, but I would suggest using https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html p.s. there is no silly questions and stats is hard for everyone 😉 You are more than welcome to ask away! |
Hi @tupui , thanks for the reply! I'll have a look at the scipy-implementation. At first look it seems though I'd need to modify the pawn.analyze() to avoid repeating the bootstrap for each statistic if using it. 🤔 Access to pawns non-aggregated results would be nice though anyway, as it could be used for regional sensitivity analysis, if I'm not wrong? |
A form of it, yes, though it's not exactly the same. One of my many plans is to separate it out so you can do this. On the bootstrapping, one reason why I didn't implement it for PAWN is that the approach I had in mind would require many sub-samples and I found calculating KS was very costly for any higher dimensional model. Though maybe it's now different with the recent changes in Python. Regarding your implementation, it looks correct to me, but do heed @tupui 's advice re adopting SciPy. The |
Thanks @ConnectedSystems , I guess I could read up a bit on the different bootstrap approaches. Regarding performance, the api of scipy.stats.bootstrap doesn't seem ideal for this case, as it seems to be couple the calculation of the (single) 'statistic' and its ci. It doesn't seem to support parallel execution either. |
The bootstrap method is not very easy to use, I agree. I had trouble reviewing it 😅 There are some tutorials to help out https://nbviewer.org/github/scipy/scipy-cookbook/blob/main/ipython/ResamplingAndMonteCarloMethods/resampling_tutorial_3.ipynb Another example of use in For the parallel part, there is a |
Thanks @tupui for the pointers, I'll look into them and give the bootstrap method another go next week. 🙂 |
Ok, I spent a bit too much time trying to figure this out. I didn't find a clean way yet, but it does in fact seem possible to get both multiple outputs and parallelize the process. I'm a somewhat confused about the use of numpy axes and didn't have a way to properly verify that this is actually doing the right thing (besides the numbers approximately matching the ones from the previous method). Anyways, here's the code so far, for what is worth. 🙄 # %% Scipy version
stats = ["minimum", "mean", "median", "maximum", "CV"]
def statistic(idx):
Xs_rs, Y_rs = Xs[idx], Y[idx]
res = pawn.analyze(problem, Xs_rs, Y_rs)
# multiple outputs seem to work this way
return np.array([res.get(key) for key in stats])
# %% vectorized and parallel
from os import cpu_count
from multiprocessing import Pool
import numpy as np
def parallelize(data, func, num_of_processes=cpu_count() - 1):
data_split = np.array_split(data, num_of_processes)
pool = Pool(num_of_processes)
data = np.concatenate(pool.map(func, data_split))
pool.close()
pool.join()
return data
def pawn_batch(idxs):
"""
For a batch of indexes.
"""
res = []
for idx in idxs:
res.append(statistic(idx))
return res
def statistic_vec(idxs: np.ndarray, axis=0):
# axis ignored here, seems to always be -1
# resamples in inputs seem to be the first axis
if len(idxs.shape) == 1: # some methods call this also with the original 1-D vector
res = statistic(idxs)
else:
res = pawn_batch(idxs)
# seems like the resamples in the result expected to be along the last axis
res = np.moveaxis(res, 0, -1)
return res
def statistic_vec_parallel(idxs: np.ndarray, axis=0):
# same as statistic_vec, but parallelized
if len(idxs.shape) == 1: #
res = statistic(idxs)
else:
res = parallelize(idxs, pawn_batch)
res = np.moveaxis(res, 0, -1)
return res
n_resamples = 999
confidence_level = 0.95
res = scipy.stats.bootstrap(
[np.arange(len(Xs))],
statistic=statistic_vec_parallel,
method="bca",
n_resamples=n_resamples,
confidence_level=confidence_level,
)
df_si_conf2 = pd.DataFrame()
conf_level = 0.95
for i, name in enumerate(stats): # assuming stats are still in the same order
df_si_conf2[f"{name}_conf_low"] = res.confidence_interval.low[i]
df_si_conf2[f"{name}_conf_high"] = res.confidence_interval.high[i]
df_si_conf2[f"{name}_se"] = res.standard_error[i]
# salib 'conf' metric (error margin?)
df_si_conf2[f"{name}_conf"] = (
norm.ppf(0.5 + conf_level / 2.0) * res.standard_error[i]
)
df_si_conf2 |
Hi all,
I'm trying to wrap my head around this PAWN method. There seems to be no built-in confidence interval calculation, so I followed the RBDFast example in #373
Output:
The above should calculate the confidence intervals for the min/mean/median/max values for each parameter, but I started to suspect this might not be the right way to do it. Mostly, I'm not comfortable with taking statistics of statistics here (sd of min/max/mean/median). Any thoughts, is this a valid way to go?
The text was updated successfully, but these errors were encountered: