question on missRanger #18

BenoitLondon · 2019-07-18T10:38:28Z

Hi,
very nice package. I love ranger package as well so very good idea to use it for imputation!
I have 4 questions actually:

I was just wondering if it's possible to impute data based on some subset of rows like in cross validation settings or in my case in a time dynamic setup.

E.g. I have time series data and I would like to impute values using only past data, is there a better way than calling missRanger repetitively at each time point subsetting on the past?

Even in that case, should I impute the next day using the raw past data or using the previously completed data?
Similarly if I have a train and test dataset is there a way to apply the rules of the train dataset imputation to the test one without rerunning the algorithm? (can missRanger return the imputation model?)
Do you have recommendations about when using extratrees splitrule? Is it better and if so in which cases?

Thanks!

mayer79 · 2019-07-18T17:04:51Z

Hi Benoit

Good input, thanks.

It would be a great feature to have, but unfortunately it is not implemented yet. There are two reasons. Firstly, random forests are huge to store in memory or on disk, so $m$ random forests are even larger. Secondly, if there is just one variable with missing values, such "out-of-sample" application works in a straightforward way. But what if you want to impute a new observation with >1 missing values? One possible algorithm would be to fill all missings by a default value first and then apply the fitted random forests iteratively until the imputations stabilize. Any good ideas here?

Due to this, I don't think I can give a positive answer to 1) and 2) yet.

Extra trees is less greedy (it uses random split points instead of optimal ones) and thus also faster than a random forest. In my experience, random forests are usually more accurate except in high-dimensional cases where extra variability could be a plus.

BenoitLondon · 2019-07-19T16:33:22Z

Thanks for your answers!

saudiwin · 2020-01-02T09:56:06Z

Hi @BenoitLondon ,

Jumping in late here, but it doesn't seem to me like you need to be concerned about past/present values with random forests as it's essentially a non-parametric technique, and thus it will capture time dependence by approximating it as an unknown function. To ensure it captures time features, I think you can just make sure to include a time counter/index.

mayer79 closed this as completed Jan 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question on missRanger #18

question on missRanger #18

BenoitLondon commented Jul 18, 2019 •

edited

Loading

mayer79 commented Jul 18, 2019 •

edited

Loading

BenoitLondon commented Jul 19, 2019

saudiwin commented Jan 2, 2020

question on missRanger #18

question on missRanger #18

Comments

BenoitLondon commented Jul 18, 2019 • edited Loading

mayer79 commented Jul 18, 2019 • edited Loading

BenoitLondon commented Jul 19, 2019

saudiwin commented Jan 2, 2020

BenoitLondon commented Jul 18, 2019 •

edited

Loading

mayer79 commented Jul 18, 2019 •

edited

Loading