Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confidence intervals for PAWN #596

Open
halla opened this issue Oct 20, 2023 · 7 comments
Open

Confidence intervals for PAWN #596

halla opened this issue Oct 20, 2023 · 7 comments
Labels

Comments

@halla
Copy link

halla commented Oct 20, 2023

Hi all,

I'm trying to wrap my head around this PAWN method. There seems to be no built-in confidence interval calculation, so I followed the RBDFast example in #373

# %%
import numpy as np
import pandas as pd
from scipy.stats import norm

from SALib.analyze import pawn
from SALib.sample import latin
from SALib.test_functions import Ishigami


problem = {
  'num_vars': 3,
  'names': ['x1', 'x2', 'x3'],
  'bounds': [[-np.pi, np.pi]]*3
}

Xs = latin.sample(problem, 1000)
Y = Ishigami.evaluate(Xs)

def bootstrap(Xs, Y, num_resamples = 100):
    num_results = len(Y)

    # use ~50% of available data for each resample, as an example
    n_size = int(num_results * 0.5)

    res = []
    for i in range(num_resamples):
        sample_idx = np.random.choice(num_results, replace=True, size=n_size)
        X_rs, Y_rs = Xs[sample_idx], Y[sample_idx]
        Si = pawn.analyze(problem, X_rs, Y_rs)
        res.append(Si.to_df())
    return res

def conf_int(vals, conf_level=0.95):
    return norm.ppf(0.5 + conf_level / 2.0) * vals.std(ddof=1)

results = bootstrap(Xs, Y)
df_si_conf = pd.concat(results).reset_index().groupby('index').agg(conf_int)
df_si_conf.columns = [f"{col}_conf" for col in df_si_conf.columns]
print(df_si_conf)

Output:

       minimum_conf  mean_conf  median_conf  maximum_conf   CV_conf
index                                                              
x1         0.040621   0.029645     0.037064      0.077842  0.092398
x2         0.067186   0.026071     0.049795      0.069738  0.076581
x3         0.026492   0.025092     0.032007      0.085758  0.131611

The above should calculate the confidence intervals for the min/mean/median/max values for each parameter, but I started to suspect this might not be the right way to do it. Mostly, I'm not comfortable with taking statistics of statistics here (sd of min/max/mean/median). Any thoughts, is this a valid way to go?

@tupui
Copy link
Member

tupui commented Oct 20, 2023

Hi @halla 👋

(I think) Bootstrapping is always fine to do. Might not be the most effective thing to do as in some cases you could instead derive analytical bounds, but done properly it does give you an idea of the variation of your statistic.

I did not check your code line by line, but I would suggest using scipy.stats.bootstrap to first avoid any mistakes and second get access to more advanced bootstrapping technics:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html

p.s. there is no silly questions and stats is hard for everyone 😉 You are more than welcome to ask away!

@tupui tupui added the question label Oct 20, 2023
@halla
Copy link
Author

halla commented Oct 20, 2023

Hi @tupui , thanks for the reply! I'll have a look at the scipy-implementation. At first look it seems though I'd need to modify the pawn.analyze() to avoid repeating the bootstrap for each statistic if using it. 🤔 Access to pawns non-aggregated results would be nice though anyway, as it could be used for regional sensitivity analysis, if I'm not wrong?

@ConnectedSystems
Copy link
Member

ConnectedSystems commented Oct 20, 2023

Access to pawns non-aggregated results would be nice though anyway, as it could be used for regional sensitivity analysis, if I'm not wrong?

A form of it, yes, though it's not exactly the same. One of my many plans is to separate it out so you can do this.

On the bootstrapping, one reason why I didn't implement it for PAWN is that the approach I had in mind would require many sub-samples and I found calculating KS was very costly for any higher dimensional model. Though maybe it's now different with the recent changes in Python.

Regarding your implementation, it looks correct to me, but do heed @tupui 's advice re adopting SciPy. The BCa approach is the one I typically prefer these days.

@halla
Copy link
Author

halla commented Oct 20, 2023

Thanks @ConnectedSystems , I guess I could read up a bit on the different bootstrap approaches. Regarding performance, the api of scipy.stats.bootstrap doesn't seem ideal for this case, as it seems to be couple the calculation of the (single) 'statistic' and its ci. It doesn't seem to support parallel execution either.

@tupui
Copy link
Member

tupui commented Oct 20, 2023

The bootstrap method is not very easy to use, I agree. I had trouble reviewing it 😅 There are some tutorials to help out https://nbviewer.org/github/scipy/scipy-cookbook/blob/main/ipython/ResamplingAndMonteCarloMethods/resampling_tutorial_3.ipynb

Another example of use in scipy.stats.sobol_indices. Here for the bootstrap part.

For the parallel part, there is a vectorized keyword. Of course that is only useful if you manage to write your stats function in such a way.

@halla
Copy link
Author

halla commented Oct 20, 2023

Thanks @tupui for the pointers, I'll look into them and give the bootstrap method another go next week. 🙂

@halla
Copy link
Author

halla commented Oct 27, 2023

Ok, I spent a bit too much time trying to figure this out. I didn't find a clean way yet, but it does in fact seem possible to get both multiple outputs and parallelize the process. I'm a somewhat confused about the use of numpy axes and didn't have a way to properly verify that this is actually doing the right thing (besides the numbers approximately matching the ones from the previous method). Anyways, here's the code so far, for what is worth. 🙄

# %% Scipy version
stats = ["minimum", "mean", "median", "maximum", "CV"]


def statistic(idx):
    Xs_rs, Y_rs = Xs[idx], Y[idx]
    res = pawn.analyze(problem, Xs_rs, Y_rs)
    # multiple outputs seem to work this way
    return np.array([res.get(key) for key in stats])


# %% vectorized and parallel

from os import cpu_count
from multiprocessing import Pool
import numpy as np


def parallelize(data, func, num_of_processes=cpu_count() - 1):
    data_split = np.array_split(data, num_of_processes)
    pool = Pool(num_of_processes)
    data = np.concatenate(pool.map(func, data_split))
    pool.close()
    pool.join()
    return data


def pawn_batch(idxs):
    """
    For a batch of indexes.
    """
    res = []
    for idx in idxs:
        res.append(statistic(idx))
    return res


def statistic_vec(idxs: np.ndarray, axis=0):
    # axis ignored here, seems to always be -1
    # resamples in inputs seem to be the first axis
    if len(idxs.shape) == 1:  # some methods call this also with the original 1-D vector
        res = statistic(idxs)
    else:
        res = pawn_batch(idxs)
        # seems like the resamples in the result expected to be along the last axis
        res = np.moveaxis(res, 0, -1)
    return res


def statistic_vec_parallel(idxs: np.ndarray, axis=0):
    # same as statistic_vec, but parallelized
    if len(idxs.shape) == 1:  #
        res = statistic(idxs)
    else:
        res = parallelize(idxs, pawn_batch)
        res = np.moveaxis(res, 0, -1)
    return res

n_resamples = 999
confidence_level = 0.95
res = scipy.stats.bootstrap(
    [np.arange(len(Xs))],
    statistic=statistic_vec_parallel,
    method="bca",
    n_resamples=n_resamples,
    confidence_level=confidence_level,
)


df_si_conf2 = pd.DataFrame()
conf_level = 0.95
for i, name in enumerate(stats):  # assuming stats are still in the same order
    df_si_conf2[f"{name}_conf_low"] = res.confidence_interval.low[i]
    df_si_conf2[f"{name}_conf_high"] = res.confidence_interval.high[i]
    df_si_conf2[f"{name}_se"] = res.standard_error[i]
    # salib 'conf' metric (error margin?)
    df_si_conf2[f"{name}_conf"] = (
        norm.ppf(0.5 + conf_level / 2.0) * res.standard_error[i]
    )

df_si_conf2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants