Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question on missRanger #18

Closed
BenoitLondon opened this issue Jul 18, 2019 · 3 comments
Closed

question on missRanger #18

BenoitLondon opened this issue Jul 18, 2019 · 3 comments

Comments

@BenoitLondon
Copy link

BenoitLondon commented Jul 18, 2019

Hi,
very nice package. I love ranger package as well so very good idea to use it for imputation!
I have 4 questions actually:

  1. I was just wondering if it's possible to impute data based on some subset of rows like in cross validation settings or in my case in a time dynamic setup.

E.g. I have time series data and I would like to impute values using only past data, is there a better way than calling missRanger repetitively at each time point subsetting on the past?

  1. Even in that case, should I impute the next day using the raw past data or using the previously completed data?

  2. Similarly if I have a train and test dataset is there a way to apply the rules of the train dataset imputation to the test one without rerunning the algorithm? (can missRanger return the imputation model?)

  3. Do you have recommendations about when using extratrees splitrule? Is it better and if so in which cases?

Thanks!

@mayer79
Copy link
Owner

mayer79 commented Jul 18, 2019

Hi Benoit

Good input, thanks.

  1. It would be a great feature to have, but unfortunately it is not implemented yet. There are two reasons. Firstly, random forests are huge to store in memory or on disk, so $m$ random forests are even larger. Secondly, if there is just one variable with missing values, such "out-of-sample" application works in a straightforward way. But what if you want to impute a new observation with >1 missing values? One possible algorithm would be to fill all missings by a default value first and then apply the fitted random forests iteratively until the imputations stabilize. Any good ideas here?

Due to this, I don't think I can give a positive answer to 1) and 2) yet.

  1. Extra trees is less greedy (it uses random split points instead of optimal ones) and thus also faster than a random forest. In my experience, random forests are usually more accurate except in high-dimensional cases where extra variability could be a plus.

@BenoitLondon
Copy link
Author

Thanks for your answers!

@saudiwin
Copy link

saudiwin commented Jan 2, 2020

Hi @BenoitLondon ,

Jumping in late here, but it doesn't seem to me like you need to be concerned about past/present values with random forests as it's essentially a non-parametric technique, and thus it will capture time dependence by approximating it as an unknown function. To ensure it captures time features, I think you can just make sure to include a time counter/index.

@mayer79 mayer79 closed this as completed Jan 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants