Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How would one deal with categorical data? #195

Open
bretttully opened this issue Apr 12, 2018 · 5 comments
Open

How would one deal with categorical data? #195

bretttully opened this issue Apr 12, 2018 · 5 comments
Labels

Comments

@bretttully
Copy link

Hi guys,
Thanks for a great library. I am building a model that has categorical and continuous parameters. I have read a few papers that suggested you could create a categorical parameter by binning a continuous value (e.g. sample with bounds [-1, 1] and the true/false category can be val >= 0). However, I've found that doing so with Saltelli I end up with duplicate sets of parameters in the problem. Do you have any advice?
Thanks!

@jdherman
Copy link
Member

Hi Brett,
Thanks for using the library. The approach you described is what I've suggested to people in the past (https://waterprogramming.wordpress.com/2014/02/11/extensions-of-salib-for-more-complex-sensitivity-analyses/).

But, I never thought about the fact that it would result in duplicate samples! If the model is very slow, clearly this is not a good use of computing time. But if the model runs reasonably quickly, is there a downside (?) I'm not sure.

From the perspective of the SA method, a small change (in the continuous variable that it samples), results in "no change" in the output, because the inputs are rounded. This would be the same situation as if the model itself had some discontinuity, even with continuous inputs. So I'd have to think it's still a valid application of these SA methods.

We don't have any sampling methods written for categorical input variables -- mostly because I don't know of any published SA methods that work with them!

@bretttully
Copy link
Author

Thanks @jdherman -- in case anyone else comes here too, this is what StackOverflow has to say: https://stackoverflow.com/questions/36606101/using-python-salib-saltelli-sample-method-with-boolean-or-discrete-input-paramet

@bretttully
Copy link
Author

While you are here, I was wondering how you would approach selecting a model in the following scenario:

  • 9 parameters: 5 are boolean, 1 is choice of three, and 3 are float ranges
  • Each run of the simulation is about 75-90 hours of compute, so we can run this at most 500 times
  • There isn't a physical model that we can build to understand how the parameters should behave
  • We are trying to find the parameter set that minimises a cost function

Having watched @willu47's talk (https://www.youtube.com/watch?v=gkR_lz5OptU), it seems like Morris is more appropriate than Sobol...?

@jdherman
Copy link
Member

That's a tough situation. It's true that Morris tends to converge faster than Sobol. But 500 model runs with 9 uncertain parameters is still not very much. You might consider instead using the 500 evaluations to build a surrogate model, then running SA on that.

@willu47
Copy link
Member

willu47 commented Apr 15, 2018

Given the expense of your model, it may be worth starting with a fractional factorial approach. This requires a very small sample for the 9 parameters and will at least tell you something.

Another observation is that while your sample may contain duplicate values, you don't need to run the model again, just substitute in the model run from the first combination of variable values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants