More granular control over which cells get imputed #40

pjlambert · 2022-09-13T15:57:50Z

I would love to see a feature whereby I can feed a logical matrix of the same dimensions as the underlying data.frame into the function, which controls which cells in the data.frame get imputed, and which do not.

I currently have a data.frame which has two types of NAs: (i) data which I need to impute, and (ii) data which I know should never exist.

This situation arrises when trying to impute unbalanced panel data (e.g. annual income of a population of individuals). Since I reshape this data to be "wide" (one row per person) I end up with a number of columns (e.g. income_2010, income_2011, ... etc). This is essential to capture time-dynamics (i.e. my income this year and next year are strongly correlated).

Some for a person who died in 2005, I do not wish to impute income_2006, income_2007, etc. But for someone who's income is missing during their lifetime, I would like to impute it.

All the best - and thanks for a great package!

mayer79 · 2022-09-14T06:57:14Z

Thanks for pinging me. I think your situation occurs quite frequently and you are definitively using the right data shape (wide, not long) to do the imputation.

missRanger() needs to fill all missings during the process, because the backend ranger() does not allow to work with missing values. Thus, I see two solutions without touching missRanger()'s internal logic:

You use that logical matrix after the imputation to set the corresponding cells again to missing. (If a missing income 2006 is filled during the process, and that value is again used to impute 2007 (and other variables), it should not hurt the statistical associations.)
You split the data into two: one with persons having all missing incomes and one with the rest. Then, both data are being imputed separately. I don't think it is a very good approach compared to the first one.

pjlambert · 2022-09-15T19:17:39Z

Thanks! I agree approach 1 is optimal.

Just to add a further tip for those who have similar use-cases:

I have adopted approach (1) from above
I have also added a dummy variable for "recently born", which takes a value of 1 in the first 3 years of life, and 0 otherwise. This then also gets widened.
Likewise I add a "soon to die" variable which equals 1 for the last three years a person is alive, and 0 otherwise. This at least helps to add some life-cycle dynamics to the imputations, in a way that is not too high dimensional.

If anyone else has faced the issue of imputing missing data for unbalanced panel data, would love to hear more about the strategies used.

pjlambert · 2022-09-15T19:19:17Z

One quick follow-up issue is that when one "widens" the data, the donor pool for pmm is reduced a lot. This is not a big issue for my case - but something to consider.

mayer79 closed this as completed Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More granular control over which cells get imputed #40

More granular control over which cells get imputed #40

pjlambert commented Sep 13, 2022

mayer79 commented Sep 14, 2022

pjlambert commented Sep 15, 2022

pjlambert commented Sep 15, 2022

More granular control over which cells get imputed #40

More granular control over which cells get imputed #40

Comments

pjlambert commented Sep 13, 2022

mayer79 commented Sep 14, 2022

pjlambert commented Sep 15, 2022

pjlambert commented Sep 15, 2022