Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple imputation via bootstrapping rather than predictive mean matching #29

Closed
bgall opened this issue Apr 30, 2021 · 1 comment
Closed
Assignees

Comments

@bgall
Copy link

bgall commented Apr 30, 2021

missRanger contains the pmm.k argument to allow users to add more variability to their imputed values and obtain imputed values drawn from observed values. However, variability is already built into the random forests model by (a) generating bootstrapped "out of bag" data, (b) drawing features at random, and (c) relying on many forests. Of course, you might underestimate variability and choose to derive multiple imputed datasets. However, the documentation advises the use of predictive mean matching to add variability across each imputed dataset.

It would be worthwhile acknowledging in the documentation perhaps that this, too, will understate variability because it assumes the donor pool is random, when really all of your data are random. An approach that (a) eliminates the need to use predictive mean matching (and tune the number of nearest neighbors selected), (b) is more theoretically motivated, and (c) is more robust to "false convergence" by adding further variability to the initialization values in the chained equation is to bootstrap your entire dataset (sample rows with replacement) for each imputation you want and then run missRanger() on the bootstrapped datasets. As far as I can tell, this isn't any more computationally complex and is strictly better.

The one thing to be aware of then is that your random number seed needs to be declared prior to bootstrapping the data. You can then use the same random seed for each full imputation via missRanger() since even the same seed will produce different values at that step since your data already are random.

@mayer79
Copy link
Owner

mayer79 commented May 1, 2021

I will add to the documentation that the PMM approach in missRanger() will typically still underestimate the variance, good point. However, with pmm.k = 0, the imputation values are estimated conditional means (namely predictions from a model) and thus, the variability would be grossly underestimated. The randomness from a random forest prediction cannot compensate for the missing sampling error. It is a bit like the difference between the standard deviation of a variable and the standard deviation of its sample mean. Once fitted, there is no variability in random forest predictions anymore. PMM would add at least some of this.

One of the reasons for doing PMM in its current form is to create realistic values (e.g. for a variable with values in {1, 2}), while other approaches would return values like 1.2221 etc.

I will keep your approach for multiple imputation in mind, it seems like a very good approach! If I find time, I might add this approach to the "multiple imputation" vignette.

@mayer79 mayer79 self-assigned this May 1, 2021
@mayer79 mayer79 closed this as not planned Won't fix, can't repro, duplicate, stale Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants