Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow out-of-sample application #58

Open
jeandigitale opened this issue Nov 29, 2023 · 1 comment
Open

Allow out-of-sample application #58

jeandigitale opened this issue Nov 29, 2023 · 1 comment
Assignees

Comments

@jeandigitale
Copy link

Hi,

I'm excited about the new keep_forests option. I was hoping to use it to train imputation forests on a training set and then use those models to impute on a test set. However, when I try, I get an error that I am missing data in other columns in the test set and therefore can't predict out the imputations for a given variable. Is there any way around that?

Note that I think in the documentation it says: "Only relevant when data_only = TRUE (and when forests are grown)." I think you meant FALSE.

Thanks!

@mayer79
Copy link
Owner

mayer79 commented Nov 29, 2023

Good catch, thanks, also for the typo.

Having a predict() function would be very neat. As far as I know, there is no "official" way to do so. Here is a sketch:

  1. Impute first randomly from non-missing values of original "training" data used to fit missRanger().
  2. Apply predictions iteratively, say, three times.
library(missRanger)

irisWithNA <- generateNA(iris, seed = 34)

.in <- c(1:40, 51:90, 101:140)
data_train <- irisWithNA[.in, ]

imp <- missRanger(
  irisWithNA[.in, ], pmm.k = 3, num.trees = 100, data_only = FALSE, keep_forests = TRUE
)

newdata <- irisWithNA[-.in, ]

# data_train is the original unimputed dataset used to fit missRanger(). 
# Will add it to the "missRanger" object later to simplify the API
predict.missRanger <- function(x, newdata, data_train, n_iter = 3, pmm.k = 5) {
  to_fill <- is.na(newdata[x$visit_seq])
  to_fill_train <- is.na(data_train)

  # Initialize by randomly picking from original data
  for (v in x$visit_seq) {
    m <- sum(to_fill[, v])
    newdata[[v]][to_fill[, v]] <- sample(
      data_train[[v]][!to_fill_train[, v]], size = m, replace = TRUE
    )
  }
  
  for (i in seq_len(n_iter)) {
    for (v in x$visit_seq) {
      v_na <- to_fill[, v]
      pred <- predict(x$forests[[v]], newdata[v_na, ])$predictions
      if (pmm.k > 0) {
        pred <- pmm(
          xtrain = x$forests[[v]]$predictions, 
          xtest = pred, 
          ytrain = data_train[[v]][!is.na(data_train[[v]])], 
          k = pmm.k
        )
      }
      newdata[v_na, v] <- pred
    }
  }
  newdata
}

out <- predict(imp, new_data, data_train = data_train)
head(out)
head(iris[.in, ])

# Did not change existing values? (Should be TRUE)
all(out[!is.na(newdata)] == newdata[!is.na(newdata)])

# Any missings left? Should be FALSE
anyNA(out)

@mayer79 mayer79 changed the title keep_forests Allow out-of-sample application Nov 30, 2023
@mayer79 mayer79 self-assigned this Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants