-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow out-of-sample application #58
Labels
Comments
Good catch, thanks, also for the typo. Having a
library(missRanger)
irisWithNA <- generateNA(iris, seed = 34)
.in <- c(1:40, 51:90, 101:140)
data_train <- irisWithNA[.in, ]
imp <- missRanger(
irisWithNA[.in, ], pmm.k = 3, num.trees = 100, data_only = FALSE, keep_forests = TRUE
)
newdata <- irisWithNA[-.in, ]
# data_train is the original unimputed dataset used to fit missRanger().
# Will add it to the "missRanger" object later to simplify the API
predict.missRanger <- function(x, newdata, data_train, n_iter = 3, pmm.k = 5) {
to_fill <- is.na(newdata[x$visit_seq])
to_fill_train <- is.na(data_train)
# Initialize by randomly picking from original data
for (v in x$visit_seq) {
m <- sum(to_fill[, v])
newdata[[v]][to_fill[, v]] <- sample(
data_train[[v]][!to_fill_train[, v]], size = m, replace = TRUE
)
}
for (i in seq_len(n_iter)) {
for (v in x$visit_seq) {
v_na <- to_fill[, v]
pred <- predict(x$forests[[v]], newdata[v_na, ])$predictions
if (pmm.k > 0) {
pred <- pmm(
xtrain = x$forests[[v]]$predictions,
xtest = pred,
ytrain = data_train[[v]][!is.na(data_train[[v]])],
k = pmm.k
)
}
newdata[v_na, v] <- pred
}
}
newdata
}
out <- predict(imp, new_data, data_train = data_train)
head(out)
head(iris[.in, ])
# Did not change existing values? (Should be TRUE)
all(out[!is.na(newdata)] == newdata[!is.na(newdata)])
# Any missings left? Should be FALSE
anyNA(out) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
I'm excited about the new keep_forests option. I was hoping to use it to train imputation forests on a training set and then use those models to impute on a test set. However, when I try, I get an error that I am missing data in other columns in the test set and therefore can't predict out the imputations for a given variable. Is there any way around that?
Note that I think in the documentation it says: "Only relevant when data_only = TRUE (and when forests are grown)." I think you meant FALSE.
Thanks!
The text was updated successfully, but these errors were encountered: