Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement (something similar to randomForest::na.roughfix) #4

Closed
thierrygosselin opened this issue Jan 7, 2017 · 2 comments
Closed

Comments

@thierrygosselin
Copy link
Contributor

Hi Michael,

I don't know what's your target audience/users, but currently in missRanger::missRanger (line 67):

you're using fit <- ranger::ranger(stats::reformulate(completed, response = v) where completed requires columns to have no missing data. This behaviour for some dataset (e.g. genomic) can create huge bias by drastically reducing the number of available variables for training.

I suggest an enhancement similar to randomForest::na.roughfix where an additional argument would give user the possibility of quickly filling missing values of predictor/training set columns.

Also discussed here: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1

Missing value replacement for the training set
Random forests has two ways of replacing missing values. The first way is fast. If the mth variable is not categorical, the method computes the median of all values of this variable in class j, then it uses this value to replace all missing values of the mth variable in class j. If the mth variable is categorical, the replacement is the most frequent non-missing value in class j. These replacement values are called fills.
The second way of replacing missing values is computationally more expensive but has given better performance than the first, even with large amounts of missing data. It replaces missing values only in the training set. It begins by doing a rough and inaccurate filling in of the missing values. Then it does a forest run and computes proximities.
If x(m,n) is a missing continuous value, estimate its fill as an average over the non-missing values of the mth variables weighted by the proximities between the nth case and the non-missing value case. If it is a missing categorical variable, replace it by the most frequent non-missing value where frequency is weighted by proximity.
Now iterate-construct a forest again using these newly filled in values, find new fills and iterate again. Our experience is that 4-6 iterations are enough.

Cheers
Thierry

@thierrygosselin
Copy link
Contributor Author

additional thoughts on this...

I don't see in ranger's code anywhere how they handle missing in predictor, I guess they don't, but for the genomic example using GWAS data they say (line 70 in ranger::ranger):

Note that missing values are treated as an extra category while splitting

That might be an adequate solution ?

@mayer79
Copy link
Owner

mayer79 commented Jan 7, 2017

Thanks for your comments, which are always very welcome. The rangers are still evaluating the best method to allow for missing values in predictors, I am waiting for this already quite a bit ;). No panic about the completed columns in missRanger: It might be empty at the beginning of the iterative procedure and is then built up step by step in the first iteration.

Let me demonstrate with a data set without any complete row:

# input
mydat <- data.frame(x = c(NA, NA, 1), y = c(NA, 2, NA))
mydat
missRanger(mydat)

# output
  x y
1 1 2
2 1 2
3 1 2

grafik

Personally, I use missRanger usually after logical imputations. So for instance if I have a column with only "x" and a lot of NA (as we typically have with tickbox data), then I manually replace the NA by "Not ticked" (or just use a dummy being 1 if "x" and 0 else). In certain applications, it makes sense to replace all categorical variables by a new category like "none", but not always. At the moment I am evaluating different ways how to further develop missRanger. An idea would be to add an option minPropForNone = 1, which would replace all missing values in categorical factors with more than minPropForNone missings by "none". In a next step, we could add similar rules for highly discrete numeric columns.

@mayer79 mayer79 closed this as completed Mar 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants