enhancement (something similar to randomForest::na.roughfix) #4

thierrygosselin · 2017-01-07T16:12:42Z

Hi Michael,

I don't know what's your target audience/users, but currently in missRanger::missRanger (line 67):

you're using fit <- ranger::ranger(stats::reformulate(completed, response = v) where completed requires columns to have no missing data. This behaviour for some dataset (e.g. genomic) can create huge bias by drastically reducing the number of available variables for training.

I suggest an enhancement similar to randomForest::na.roughfix where an additional argument would give user the possibility of quickly filling missing values of predictor/training set columns.

Also discussed here: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1

Missing value replacement for the training set
Random forests has two ways of replacing missing values. The first way is fast. If the mth variable is not categorical, the method computes the median of all values of this variable in class j, then it uses this value to replace all missing values of the mth variable in class j. If the mth variable is categorical, the replacement is the most frequent non-missing value in class j. These replacement values are called fills.
The second way of replacing missing values is computationally more expensive but has given better performance than the first, even with large amounts of missing data. It replaces missing values only in the training set. It begins by doing a rough and inaccurate filling in of the missing values. Then it does a forest run and computes proximities.
If x(m,n) is a missing continuous value, estimate its fill as an average over the non-missing values of the mth variables weighted by the proximities between the nth case and the non-missing value case. If it is a missing categorical variable, replace it by the most frequent non-missing value where frequency is weighted by proximity.
Now iterate-construct a forest again using these newly filled in values, find new fills and iterate again. Our experience is that 4-6 iterations are enough.

Cheers
Thierry

The text was updated successfully, but these errors were encountered:

thierrygosselin · 2017-01-07T16:17:11Z

additional thoughts on this...

I don't see in ranger's code anywhere how they handle missing in predictor, I guess they don't, but for the genomic example using GWAS data they say (line 70 in ranger::ranger):

Note that missing values are treated as an extra category while splitting

That might be an adequate solution ?

mayer79 · 2017-01-07T16:59:31Z

Thanks for your comments, which are always very welcome. The rangers are still evaluating the best method to allow for missing values in predictors, I am waiting for this already quite a bit ;). No panic about the completed columns in missRanger: It might be empty at the beginning of the iterative procedure and is then built up step by step in the first iteration.

Let me demonstrate with a data set without any complete row:

# input
mydat <- data.frame(x = c(NA, NA, 1), y = c(NA, 2, NA))
mydat
missRanger(mydat)

# output
  x y
1 1 2
2 1 2
3 1 2

Personally, I use missRanger usually after logical imputations. So for instance if I have a column with only "x" and a lot of NA (as we typically have with tickbox data), then I manually replace the NA by "Not ticked" (or just use a dummy being 1 if "x" and 0 else). In certain applications, it makes sense to replace all categorical variables by a new category like "none", but not always. At the moment I am evaluating different ways how to further develop missRanger. An idea would be to add an option minPropForNone = 1, which would replace all missing values in categorical factors with more than minPropForNone missings by "none". In a next step, we could add similar rules for highly discrete numeric columns.

mayer79 closed this as completed Mar 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhancement (something similar to randomForest::na.roughfix) #4

enhancement (something similar to randomForest::na.roughfix) #4

thierrygosselin commented Jan 7, 2017

thierrygosselin commented Jan 7, 2017

mayer79 commented Jan 7, 2017 •

edited

Loading

enhancement (something similar to randomForest::na.roughfix) #4

enhancement (something similar to randomForest::na.roughfix) #4

Comments

thierrygosselin commented Jan 7, 2017

thierrygosselin commented Jan 7, 2017

mayer79 commented Jan 7, 2017 • edited Loading

mayer79 commented Jan 7, 2017 •

edited

Loading