Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question #3

Closed
thierrygosselin opened this issue Dec 21, 2016 · 2 comments
Closed

Question #3

thierrygosselin opened this issue Dec 21, 2016 · 2 comments

Comments

@thierrygosselin
Copy link
Contributor

Quick question Michael...

Scenario where you have more than 1 response variable missing:

e.g. with the iris dataset
let say Sepal.Length and Sepal.Width are missing
we know that both of these values are correlated together with Species.

Your implementation imputes by column, is the correlation between columns is still accounted for in the model ? Because, we don't want to have imputed values that taken together after imputations don't "fit" the species...

Best,
Thierry

@mayer79
Copy link
Owner

mayer79 commented Dec 22, 2016

Hello Thierry

The algorithm tries to take into account all statistical associations between all variables. So, at least in theory, the answer will be positive. In practice, if you have e.g. too little data or if the values are not missing at random, then it does not work too well in general.

Let us see what happens to our iris data:

set.seed(398745)
# Replace some values by NA
iris2 <- iris
iris2$Sepal.Length[sample(150, 20)] <- NA
iris2$Sepal.Width[sample(150, 40)] <- NA
table(is.na(iris2$Sepal.Length), is.na(iris2$Sepal.Width))

# Output
       FALSE TRUE
  FALSE    94   36
  TRUE     16    4

So there are 20 missing values in Sepal.Length and 40 in Sepal.Width.

Now let's fill those values again by running

  iris3 <- missRanger(iris2, pmm = 3, seed = 3483)

and compare the joint distribution of the two variables stratified by Species (= color) in the original data set (left) and after imputation (right).

par(mfrow = 1:2)
plot(Sepal.Length ~ Sepal.Width, data = iris, col = Species, main = "original")
plot(Sepal.Length ~ Sepal.Width, data = iris3, col = Species, main = "imputed")

grafik

Of course, the pictures are not identical, but the structure seems to be retained.

@thierrygosselin
Copy link
Contributor Author

Related to this, check out what this guy does for the iris dataset...
http:https://www.markvanderloo.eu/yaRb/2016/09/13/announcing-the-simputation-package-make-imputation-simple/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants