-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling NaN/inf model output #273
Comments
Hello, everyone. Is there anyone working on this topic? I have an application that needs this feature to deal with NaN values, so I would like (need) to implement it. I like the idea of using a response surface to replace NaNs too, I’ll take a look on how this can be done. What I did in the past when using the Morris method was to remove the entire trajectory once one of its points failed to produce a valid output. This can be a simple solution in this specific case. |
Good day. When I encountered Nan, while using sobol analyze - I just removed such broken batches from the sample Y and continued to analyze. So, if your model have some probability of failing, one of the ways is to set more samples and remove broken batches from result Y. |
Are there methods in SALib which allow to manage Nans in this simple way (i.e. just removing values) and perform a global sensitivity analysis? As far as I understand Delta-MIM is for first order effects. Thanks in advance! |
Removing Nan results from your sample set before analyzing with Delta-MIM should not cause any problems when computing both delta and S1 values. So not only for S1. I have applied this approach in several analysis now, and it works great. |
To answer your question, SALib does not provide any way to remove NaN values. At this point in time at least you would have to provide a "cleaned" dataset. In case it is helpful, see this SO post which shows how you can drop rows with NaN values from a numpy array: |
Hi! Sorry for the dumb question, I hadn't read carefully enough the references in the documentation. Nonetheless I was wondering if the same considerations also hold for RBD-FAST and DGSM. Furthermore, for these three methods, what are the assumptions on the samples? What if I have samples which are not drawn from the default sampler (i.e. LHS for for Delta-MIM and RBD-FAST). I don't know if this issue is the right place to ask however, sorry if not. |
Not a dumb question at all. DMIM and RBD-FAST are sampling independent methods so hypothetically you can use any sampling approach. DGSM expects results that map to samples generated by the The issue with dropping results with NaNs is that you lose (possibly important) information regarding the input-output relationship. NaN results may only pertain to one quantity of interest for example. Another consideration is that NaNs could be taken as a sign that their are issues with the model implementation (e.g. bugs or other). Where possible, I'd recommend investigating the cause of NaN values and addressing that, rather than outright removal. |
Just an idea: Instead of replacing NaN values with some interpolation for the variance-based method (Sobol), could another solution be to implement the given-data approach as laid out in among other: Plischke, E., Borgonovo, E. and Smith, C.: 2013, Global sensitivity measures from given data. As I understand it, in the given-data approach, you don't rely on specific sampling strategies, which would decrease the computational burden when you want to perform several SA methods. Also, it would be sampling independent, so you could remove NaN results. |
A brief discussion and references on model failure in GSA can be found in Rasavi et. al, 2021, an excerpt of which is reproduced here in accordance with the CC-BY 4.0 licence:
|
Just to be sure, it is not recommended to remove only the broken points (results and parameters) where the model failed, one should actually remove the entire batch/ trajectory, right? |
Removing a trajectory may render the entire sample unusable depending on the analysis approach. This may be true both on grounds of theory and in terms of how the algorithms are implemented. For example, removing a batch from a Sobol' sample means the sample no longer follows the Sobol' sequence and so runs into theoretical issues. Removing the batch also changes the underlying data structure such that it may not be "usable" in the sense that it will likely not match what is expected by the implemented analysis functions. This is a general consideration, not specific to SALib. Considering the cost of model evaluation, I would generally prefer to use "given data" approaches in these circumstances as it requires only the "broken" runs to be removed and no "usable" samples are "lost".
To me, any approach which "fills in" missing data comes with the risk of introducing bias. This is not to say it's a bad idea, just that there are further considerations. One idea is to check how much choice in the replacement values (e.g., interpolated between or the mean/median of neighbouring values) might affect the sensitivity indices. Depending on the context of your work if the indices do not change "much", then having the "missing" information likely would not have changed your final outcome. |
OK so for a Sobol analysis it will be better to remove only the bad points. However I think for a Morris analysis removing only the bad points rather than the entire trajectory will create different trajectories order that will mess up the analysis structure. So in a Morris analysis won't it be better to remove the entire trajectory? |
I think for Sobol'-based analyses any removal will effectively change the structure, so it won't be a Sobol' sample any more. In this case, if Sobol' analysis is the only option then I think it would be more defensible to "fill" those missing values via interpolation rather than removing it. For Morris, I'm not sure what implications removing a trajectory has in terms of change in data structure. It would also mean you are also removing exploration of that trajectory, which might severely limit the analysis. Removing an entire trajectory for the sake of a single "bad" point seems wasteful as well. I think I would prefer to interpolate between valid values if given data approaches were not an option. A third option that I neglected to mention is to emulate the original model and use that for sensitivity analysis. Some approaches produce sensitivity estimates as a by-product. Depending on the model (and emulation approach) this process can become quite involved, however.
This seems sensible to me. At the very least you'll get an understanding of how important those missing values are. |
So an update on the results of the three methods:
So it seems to me removing full trajectories might be the safest option with a morris analysis - though we lose some samples, we can be sure of the results that are left. Another question maybe for a different discussion but I will try, if many (60%) of my output values (concentration) are 0, does it present a problem in checking the model sensitivity (many of the samples are non sensitive)? |
This has been maybe the most common question / annoyance for users: what to do about missing data in the output vector
Y
?(See #134, #255, #262, #237, #206, and PR #235 )
We need a consistent way to deal with this across all methods. I'd suggest:
analyze()
methods should raise an error if NaNs are detected, maybe with a count.a) Find the bad input parameters and modify the sampling ranges to avoid the failures. This may be easier said than done, assuming the parameter combinations are not linearly separable. But we could think about adding tools to help find regions of the parameter space that cause problems. In a previous thread (#262) @TobiasKAndersen used CART from scikit-learn to do this.
b) Allow the option to replace the missing values with something else. The best suggestion we've had, credit to @dmey, would be to build a response surface model of the input->output relationship using the non-missing values, and use it to fill the missing ones. This is more complicated than just replacing with zero or the mean value, but it's less likely to introduce biases.
For example, after getting the error above, you could switch to:
Not sure what would be the best option for interpolation. Scipy is little clunky for larger parameter dimensions. Maybe scikit-learn? I don't know of any papers that have tested these methods, so there will need to be some experimentation first.
Y
, and see how close to the trueSi
values it can get.The text was updated successfully, but these errors were encountered: