Multiple imputation via bootstrapping rather than predictive mean matching #29

bgall · 2021-04-30T23:54:47Z

missRanger contains the pmm.k argument to allow users to add more variability to their imputed values and obtain imputed values drawn from observed values. However, variability is already built into the random forests model by (a) generating bootstrapped "out of bag" data, (b) drawing features at random, and (c) relying on many forests. Of course, you might underestimate variability and choose to derive multiple imputed datasets. However, the documentation advises the use of predictive mean matching to add variability across each imputed dataset.

It would be worthwhile acknowledging in the documentation perhaps that this, too, will understate variability because it assumes the donor pool is random, when really all of your data are random. An approach that (a) eliminates the need to use predictive mean matching (and tune the number of nearest neighbors selected), (b) is more theoretically motivated, and (c) is more robust to "false convergence" by adding further variability to the initialization values in the chained equation is to bootstrap your entire dataset (sample rows with replacement) for each imputation you want and then run missRanger() on the bootstrapped datasets. As far as I can tell, this isn't any more computationally complex and is strictly better.

The one thing to be aware of then is that your random number seed needs to be declared prior to bootstrapping the data. You can then use the same random seed for each full imputation via missRanger() since even the same seed will produce different values at that step since your data already are random.

The text was updated successfully, but these errors were encountered:

mayer79 · 2021-05-01T07:52:17Z

I will add to the documentation that the PMM approach in missRanger() will typically still underestimate the variance, good point. However, with pmm.k = 0, the imputation values are estimated conditional means (namely predictions from a model) and thus, the variability would be grossly underestimated. The randomness from a random forest prediction cannot compensate for the missing sampling error. It is a bit like the difference between the standard deviation of a variable and the standard deviation of its sample mean. Once fitted, there is no variability in random forest predictions anymore. PMM would add at least some of this.

One of the reasons for doing PMM in its current form is to create realistic values (e.g. for a variable with values in {1, 2}), while other approaches would return values like 1.2221 etc.

I will keep your approach for multiple imputation in mind, it seems like a very good approach! If I find time, I might add this approach to the "multiple imputation" vignette.

mayer79 self-assigned this May 1, 2021

mayer79 added the documentation label May 1, 2021

mayer79 closed this as not planned Won't fix, can't repro, duplicate, stale Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple imputation via bootstrapping rather than predictive mean matching #29

Multiple imputation via bootstrapping rather than predictive mean matching #29

bgall commented Apr 30, 2021

mayer79 commented May 1, 2021

Multiple imputation via bootstrapping rather than predictive mean matching #29

Multiple imputation via bootstrapping rather than predictive mean matching #29

Comments

bgall commented Apr 30, 2021

mayer79 commented May 1, 2021