-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question on PMM #7
Comments
Hello Thierry On LightGBM: I don't think that gradient boosters are very helpful. The reason is that they have about ten parameters that have to be chosen carefully. Unlike random forests, which usually work acceptably well even without any tuning. But for other purposes, LightGBM is ingenious. On PMM: I am not sure if I got your question. The idea is as follows: With k-PMM, the missing value in variable x and observation i is not directly filled in by the OOB prediction of the random forest. Instead, the OOB prediction of observation i is compared with all OOB predictions of observations without missing x. Among the k nearest OOB predictions, an observation is picked at random. Then their x value is used for imputation. Since the OOB predictions are a function of all variables (except x), actually the match is done implicitly on all variables (except x), not unlike propensity score matching. Let me give you an example
The first observation's x2 value is close to the x2 value of observations 2-4. Thus, their OOB predictions for x1 will be quite close. Consequently, the first value is picked with one of their x1 values (which is 1). |
Hi Michael, thanks for the quick reply, very helpful ! Best |
Wow - I am very much looking forward seeing the results! My PMM code is actually very hard to read, but only because I wanted to be able to deal with categorical variables. For purely numeric data, it is actually very much simpler. |
it's fine, and the data is categorical. The problem faced with imputation and RADseq data is with low frequency genotypes. The problem I was first facing with XGBoost or LightGBM is that I have to use a training/test set by first splitting my non missing genotypes then I do the imputations. In missRanger::pmm the The The couple of tests conducted shows an increase in variance. More low frequency genotypes are reintroduced in the imputed data, which is good. Otherwise, they were dropped by the model. |
Hi Michael, my question is regarding PMM.
e.g. a dataset with 10000 variable with different level of missingness
Is their a potential for bias if PMM is carried out after the model for one variable ?
Since the knn will be only between that variable's values and not accounting all the variable.
If all variable were accounted for distance, neighbours would be different, I suppose...
and not related to missRanger, I see you've started working with LightGBM, have you tried imputations with it ?
Best
Thierry
The text was updated successfully, but these errors were encountered: