Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling NaN/inf model output #273

Open
jdherman opened this issue Nov 7, 2019 · 15 comments
Open

Handling NaN/inf model output #273

jdherman opened this issue Nov 7, 2019 · 15 comments
Labels
priority An issue on which several users have requested action

Comments

@jdherman
Copy link
Member

jdherman commented Nov 7, 2019

This has been maybe the most common question / annoyance for users: what to do about missing data in the output vector Y ?
(See #134, #255, #262, #237, #206, and PR #235 )

We need a consistent way to deal with this across all methods. I'd suggest:

  1. All analyze() methods should raise an error if NaNs are detected, maybe with a count.
Si = sobol.analyze(problem, Y)

>> Error: Y contains NaN/inf values (2 / 5000 samples)
  1. For the Delta-MIM method, you can just remove the missing values and compute the Si values as usual. But for all other methods, the order of the samples is important, so you need to either:

a) Find the bad input parameters and modify the sampling ranges to avoid the failures. This may be easier said than done, assuming the parameter combinations are not linearly separable. But we could think about adding tools to help find regions of the parameter space that cause problems. In a previous thread (#262) @TobiasKAndersen used CART from scikit-learn to do this.

b) Allow the option to replace the missing values with something else. The best suggestion we've had, credit to @dmey, would be to build a response surface model of the input->output relationship using the non-missing values, and use it to fill the missing ones. This is more complicated than just replacing with zero or the mean value, but it's less likely to introduce biases.

For example, after getting the error above, you could switch to:

Si = sobol.analyze(problem, Y, fill_na = True) # (false by default)

Not sure what would be the best option for interpolation. Scipy is little clunky for larger parameter dimensions. Maybe scikit-learn? I don't know of any papers that have tested these methods, so there will need to be some experimentation first.

  1. Finally, all of this should have tests attached. (1) would be easy to make sure the errors are raised. For (2b) we could take the Ishigami model and add some NaNs to Y, and see how close to the true Si values it can get.
@jdherman jdherman added this to Features in SALib Development Roadmap Nov 7, 2019
@jdherman jdherman added the priority An issue on which several users have requested action label Nov 7, 2019
@jdherman jdherman moved this from Features to Docs in SALib Development Roadmap Nov 11, 2019
@lbteixeira
Copy link
Contributor

Hello, everyone. Is there anyone working on this topic? I have an application that needs this feature to deal with NaN values, so I would like (need) to implement it.

I like the idea of using a response surface to replace NaNs too, I’ll take a look on how this can be done.

What I did in the past when using the Morris method was to remove the entire trajectory once one of its points failed to produce a valid output. This can be a simple solution in this specific case.

@DmitriyValetov
Copy link

Good day. When I encountered Nan, while using sobol analyze - I just removed such broken batches from the sample Y and continued to analyze. So, if your model have some probability of failing, one of the ways is to set more samples and remove broken batches from result Y.

@stefanocampanella
Copy link

For the Delta-MIM method, you can just remove the missing values and compute the Si values as usual.

Are there methods in SALib which allow to manage Nans in this simple way (i.e. just removing values) and perform a global sensitivity analysis? As far as I understand Delta-MIM is for first order effects. Thanks in advance!

@TobiasKAndersen
Copy link
Contributor

Removing Nan results from your sample set before analyzing with Delta-MIM should not cause any problems when computing both delta and S1 values. So not only for S1. I have applied this approach in several analysis now, and it works great.

@ConnectedSystems
Copy link
Member

Hi @stefanocampanella

To answer your question, SALib does not provide any way to remove NaN values. At this point in time at least you would have to provide a "cleaned" dataset.

In case it is helpful, see this SO post which shows how you can drop rows with NaN values from a numpy array:

https://stackoverflow.com/questions/11453141/how-to-remove-all-rows-in-a-numpy-ndarray-that-contain-non-numeric-values

@stefanocampanella
Copy link

Hi! Sorry for the dumb question, I hadn't read carefully enough the references in the documentation.

Nonetheless I was wondering if the same considerations also hold for RBD-FAST and DGSM. Furthermore, for these three methods, what are the assumptions on the samples? What if I have samples which are not drawn from the default sampler (i.e. LHS for for Delta-MIM and RBD-FAST). I don't know if this issue is the right place to ask however, sorry if not.

@ConnectedSystems
Copy link
Member

Not a dumb question at all.

DMIM and RBD-FAST are sampling independent methods so hypothetically you can use any sampling approach.

DGSM expects results that map to samples generated by the finite_diff sampler.

The issue with dropping results with NaNs is that you lose (possibly important) information regarding the input-output relationship. NaN results may only pertain to one quantity of interest for example. Another consideration is that NaNs could be taken as a sign that their are issues with the model implementation (e.g. bugs or other). Where possible, I'd recommend investigating the cause of NaN values and addressing that, rather than outright removal.

@TobiasKAndersen
Copy link
Contributor

Just an idea: Instead of replacing NaN values with some interpolation for the variance-based method (Sobol), could another solution be to implement the given-data approach as laid out in among other:

Plischke, E., Borgonovo, E. and Smith, C.: 2013, Global sensitivity measures from given data.
Borgonovo, E., X. Lu, E. Plischke, O. Rakovec, and M. C. Hill (2017), Making the most out of a hydrological model data set: Sensitivity analyses to open the model black-box.

As I understand it, in the given-data approach, you don't rely on specific sampling strategies, which would decrease the computational burden when you want to perform several SA methods. Also, it would be sampling independent, so you could remove NaN results.

@willu47
Copy link
Member

willu47 commented Apr 29, 2021

A brief discussion and references on model failure in GSA can be found in Rasavi et. al, 2021, an excerpt of which is reproduced here in accordance with the CC-BY 4.0 licence:

Lastly, an issue hindering the application of SA to large, complex models is that some models may fail to run properly (‘crash’) at particular points in the factor space and not produce a response. Simulation failures mainly occur due to non-robust numerical implementations, the violation of numerical stability conditions, or errors in programming. SA algorithms are typically ill-equipped to deal with such failures, as they require running models under many configurations of factors. In addition to improving properties of the original models (e.g., Kavetski and Clark 2010), more research is needed to equip SA algorithms to handle model failures, which is becoming a more pressing issue as the complexity of mathematical models grows. One of the very first studies addressing this issue in the context of SA is Sheikholeslami et al., 2019b, Sheikholeslami et al., 2019a, where a surrogate modeling strategy is used to fill in model output values when the original model fails. To handle this issue, strategies can also be adopted from other types of analyses. For example, Bachoc et al. (2016) used a design of experiments strategy to detect computation failures and code instabilities, and Bachoc et al. (2020b) developed a method to classify model parameters to computation-failure or -success groups during optimization.

@Meiravco
Copy link

Just to be sure, it is not recommended to remove only the broken points (results and parameters) where the model failed, one should actually remove the entire batch/ trajectory, right?
Also if it is only one or two points in a given trajectory it will be better to replace it with a model output of similar parameters (or an average of two) or an average of the output before and after in the same trajectory (if those are not so different), right?

@ConnectedSystems
Copy link
Member

ConnectedSystems commented Jun 27, 2021

@Meiravco

it is not recommended to remove only the broken points (results and parameters) where the model failed, one should actually remove the entire batch/ trajectory, right?

Removing a trajectory may render the entire sample unusable depending on the analysis approach. This may be true both on grounds of theory and in terms of how the algorithms are implemented. For example, removing a batch from a Sobol' sample means the sample no longer follows the Sobol' sequence and so runs into theoretical issues. Removing the batch also changes the underlying data structure such that it may not be "usable" in the sense that it will likely not match what is expected by the implemented analysis functions. This is a general consideration, not specific to SALib.

Considering the cost of model evaluation, I would generally prefer to use "given data" approaches in these circumstances as it requires only the "broken" runs to be removed and no "usable" samples are "lost".

Also if it is only one or two points in a given trajectory it will be better to replace it with a model output of similar parameters (or an average of two) or an average of the output before and after in the same trajectory (if those are not so different), right?

To me, any approach which "fills in" missing data comes with the risk of introducing bias. This is not to say it's a bad idea, just that there are further considerations.

One idea is to check how much choice in the replacement values (e.g., interpolated between or the mean/median of neighbouring values) might affect the sensitivity indices. Depending on the context of your work if the indices do not change "much", then having the "missing" information likely would not have changed your final outcome.

@Meiravco
Copy link

OK so for a Sobol analysis it will be better to remove only the bad points. However I think for a Morris analysis removing only the bad points rather than the entire trajectory will create different trajectories order that will mess up the analysis structure. So in a Morris analysis won't it be better to remove the entire trajectory?
I think if it is only one or two points for which I can find similar parameters I will fill in some averaged values of the similar parameters values, and if I have more than two bad poits in a given trajectory I will remove the entire trajectory. I will also try removing those trajectories completly and also remove only the bad points and I will compare the three scenarios results.

@ConnectedSystems
Copy link
Member

OK so for a Sobol analysis it will be better to remove only the bad points. However I think for a Morris analysis removing only the bad points rather than the entire trajectory will create different trajectories order that will mess up the analysis structure. So in a Morris analysis won't it be better to remove the entire trajectory?

I think for Sobol'-based analyses any removal will effectively change the structure, so it won't be a Sobol' sample any more. In this case, if Sobol' analysis is the only option then I think it would be more defensible to "fill" those missing values via interpolation rather than removing it.

For Morris, I'm not sure what implications removing a trajectory has in terms of change in data structure. It would also mean you are also removing exploration of that trajectory, which might severely limit the analysis. Removing an entire trajectory for the sake of a single "bad" point seems wasteful as well. I think I would prefer to interpolate between valid values if given data approaches were not an option.

A third option that I neglected to mention is to emulate the original model and use that for sensitivity analysis. Some approaches produce sensitivity estimates as a by-product. Depending on the model (and emulation approach) this process can become quite involved, however.

I will also try removing those trajectories completly and also remove only the bad points and I will compare the three scenarios results.

This seems sensible to me. At the very least you'll get an understanding of how important those missing values are.

@ConnectedSystems ConnectedSystems moved this from v1.4 to v1.5 onward in SALib Development Roadmap Jun 27, 2021
@Meiravco
Copy link

So an update on the results of the three methods:

  1. When I interpolated the change in output according to similar input parameter transformations I got a big bias in two input parameters morris values (sigma and mu star)
  2. When I removed full trajectories or left a value of 0 for the non converged samples I received very similar morris results that also seemed to make sense
  3. When I removed only the bad points I also received similar results to those on section 2, but the values were more compacted together, though the order of parameters importance stayed the same.

So it seems to me removing full trajectories might be the safest option with a morris analysis - though we lose some samples, we can be sure of the results that are left.

Another question maybe for a different discussion but I will try, if many (60%) of my output values (concentration) are 0, does it present a problem in checking the model sensitivity (many of the samples are non sensitive)?

@Meiravco
Copy link

I would like to suggest another aproach and recommend a study in the subject.
I also tried output data interpolation by fitting the existing data with regression algorithms like Radial Basis Function (RBF) and Nearest Neighbors. I used both, as well as the methods mentioned above, and found RBF to be the best solution for interpolation of missing data. The results using RBF data had the lowest confidence interval.
See also this paper in the subject:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority An issue on which several users have requested action
Projects
Development

No branches or pull requests

8 participants