Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More granular control over which cells get imputed #40

Closed
pjlambert opened this issue Sep 13, 2022 · 3 comments
Closed

More granular control over which cells get imputed #40

pjlambert opened this issue Sep 13, 2022 · 3 comments

Comments

@pjlambert
Copy link

I would love to see a feature whereby I can feed a logical matrix of the same dimensions as the underlying data.frame into the function, which controls which cells in the data.frame get imputed, and which do not.

I currently have a data.frame which has two types of NAs: (i) data which I need to impute, and (ii) data which I know should never exist.

This situation arrises when trying to impute unbalanced panel data (e.g. annual income of a population of individuals). Since I reshape this data to be "wide" (one row per person) I end up with a number of columns (e.g. income_2010, income_2011, ... etc). This is essential to capture time-dynamics (i.e. my income this year and next year are strongly correlated).

Some for a person who died in 2005, I do not wish to impute income_2006, income_2007, etc. But for someone who's income is missing during their lifetime, I would like to impute it.

All the best - and thanks for a great package!

@mayer79
Copy link
Owner

mayer79 commented Sep 14, 2022

Thanks for pinging me. I think your situation occurs quite frequently and you are definitively using the right data shape (wide, not long) to do the imputation.

missRanger() needs to fill all missings during the process, because the backend ranger() does not allow to work with missing values. Thus, I see two solutions without touching missRanger()'s internal logic:

  1. You use that logical matrix after the imputation to set the corresponding cells again to missing. (If a missing income 2006 is filled during the process, and that value is again used to impute 2007 (and other variables), it should not hurt the statistical associations.)
  2. You split the data into two: one with persons having all missing incomes and one with the rest. Then, both data are being imputed separately. I don't think it is a very good approach compared to the first one.

@pjlambert
Copy link
Author

Thanks! I agree approach 1 is optimal.

Just to add a further tip for those who have similar use-cases:

  • I have adopted approach (1) from above
  • I have also added a dummy variable for "recently born", which takes a value of 1 in the first 3 years of life, and 0 otherwise. This then also gets widened.
  • Likewise I add a "soon to die" variable which equals 1 for the last three years a person is alive, and 0 otherwise. This at least helps to add some life-cycle dynamics to the imputations, in a way that is not too high dimensional.

If anyone else has faced the issue of imputing missing data for unbalanced panel data, would love to hear more about the strategies used.

@pjlambert
Copy link
Author

One quick follow-up issue is that when one "widens" the data, the donor pool for pmm is reduced a lot. This is not a big issue for my case - but something to consider.

@mayer79 mayer79 closed this as completed Mar 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants