Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on PMM #7

Closed
thierrygosselin opened this issue Oct 31, 2017 · 4 comments
Closed

Question on PMM #7

thierrygosselin opened this issue Oct 31, 2017 · 4 comments

Comments

@thierrygosselin
Copy link
Contributor

Hi Michael, my question is regarding PMM.

e.g. a dataset with 10000 variable with different level of missingness
Is their a potential for bias if PMM is carried out after the model for one variable ?

Since the knn will be only between that variable's values and not accounting all the variable.
If all variable were accounted for distance, neighbours would be different, I suppose...

and not related to missRanger, I see you've started working with LightGBM, have you tried imputations with it ?

Best
Thierry

@mayer79
Copy link
Owner

mayer79 commented Oct 31, 2017

Hello Thierry
Great to hear from you. How is your package developing?

On LightGBM: I don't think that gradient boosters are very helpful. The reason is that they have about ten parameters that have to be chosen carefully. Unlike random forests, which usually work acceptably well even without any tuning. But for other purposes, LightGBM is ingenious.

On PMM: I am not sure if I got your question. The idea is as follows: With k-PMM, the missing value in variable x and observation i is not directly filled in by the OOB prediction of the random forest. Instead, the OOB prediction of observation i is compared with all OOB predictions of observations without missing x. Among the k nearest OOB predictions, an observation is picked at random. Then their x value is used for imputation. Since the OOB predictions are a function of all variables (except x), actually the match is done implicitly on all variables (except x), not unlike propensity score matching.

Let me give you an example

library(missRanger)
crazyData <- data.frame(x1 = c(NA, 1, 1, 1, 1, 2, 2, 2, 2), x2 = c(1, 2, 3, 2, 3, 5, 6, 5, 6))
filledData <- missRanger(crazyData, pmm.k = 1)
filledData

#   x1 x2
# 1  1  1
# 2  1  2
# 3  1  3
# 4  1  2
# 5  1  3
# 6  2  5
# 7  2  6
# 8  2  5
# 9  2  6

The first observation's x2 value is close to the x2 value of observations 2-4. Thus, their OOB predictions for x1 will be quite close. Consequently, the first value is picked with one of their x1 values (which is 1).

@thierrygosselin
Copy link
Contributor Author

Hi Michael, thanks for the quick reply, very helpful !
For the package, we are now working on simulated genomic data to test different imputation methods.
So far I want to have 4-5 methods to test, including missRanger and on-the-fly imputation proposed in randomForestSRC.
I was thinking of integrating your PMM approach after lightGBM and XGBoost... but like you said, there are numerous arguments to tune and not as simple as RF approaches...

Best
Thierry

@mayer79
Copy link
Owner

mayer79 commented Oct 31, 2017

Wow - I am very much looking forward seeing the results! My PMM code is actually very hard to read, but only because I wanted to be able to deal with categorical variables. For purely numeric data, it is actually very much simpler.

@thierrygosselin
Copy link
Contributor Author

it's fine, and the data is categorical.

The problem faced with imputation and RADseq data is with low frequency genotypes.

The problem I was first facing with XGBoost or LightGBM is that I have to use a training/test set by first splitting my non missing genotypes then I do the imputations.

In missRanger::pmm the xtrain argument requires to run the prediction model back to all the data (training + test set).

The xtest argument is the imputed data generated by the model prediction.

The couple of tests conducted shows an increase in variance. More low frequency genotypes are reintroduced in the imputed data, which is good. Otherwise, they were dropped by the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants