-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple imputation via bootstrapping rather than predictive mean matching #29
Comments
I will add to the documentation that the PMM approach in One of the reasons for doing PMM in its current form is to create realistic values (e.g. for a variable with values in {1, 2}), while other approaches would return values like 1.2221 etc. I will keep your approach for multiple imputation in mind, it seems like a very good approach! If I find time, I might add this approach to the "multiple imputation" vignette. |
missRanger
contains thepmm.k
argument to allow users to add more variability to their imputed values and obtain imputed values drawn from observed values. However, variability is already built into the random forests model by (a) generating bootstrapped "out of bag" data, (b) drawing features at random, and (c) relying on many forests. Of course, you might underestimate variability and choose to derive multiple imputed datasets. However, the documentation advises the use of predictive mean matching to add variability across each imputed dataset.It would be worthwhile acknowledging in the documentation perhaps that this, too, will understate variability because it assumes the donor pool is random, when really all of your data are random. An approach that (a) eliminates the need to use predictive mean matching (and tune the number of nearest neighbors selected), (b) is more theoretically motivated, and (c) is more robust to "false convergence" by adding further variability to the initialization values in the chained equation is to bootstrap your entire dataset (sample rows with replacement) for each imputation you want and then run
missRanger()
on the bootstrapped datasets. As far as I can tell, this isn't any more computationally complex and is strictly better.The one thing to be aware of then is that your random number seed needs to be declared prior to bootstrapping the data. You can then use the same random seed for each full imputation via
missRanger()
since even the same seed will produce different values at that step since your data already are random.The text was updated successfully, but these errors were encountered: