Skip to content

Commit

Permalink
Merge pull request #49 from mayer79/structuring_vignette
Browse files Browse the repository at this point in the history
Add headers to main vignette
  • Loading branch information
mayer79 committed May 26, 2023
2 parents 7059856 + 326ef31 commit 9453a1e
Showing 1 changed file with 12 additions and 6 deletions.
18 changes: 12 additions & 6 deletions vignettes/missRanger.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -71,14 +71,18 @@ irisImputed <- missRanger(irisWithNA, num.trees = 100, verbose = 0)
head(irisImputed)
```

### Predictive mean matching

It worked! Unfortunately, the new values look somewhat unnatural due to different rounding. If we would like to avoid this, we just set the `pmm.k` argument to a positive number. All imputations done during the process are then combined with a predictive mean matching (PMM) step, leading to more natural imputations and improved distributional properties of the resulting values:

``` {r}
irisImputed <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100, verbose = 0)
head(irisImputed)
```

Note that `missRanger()` offers a `...` argument to pass options to `ranger()`, e.g. `num.trees` or `min.node.size`. How would we use its "extremely randomized trees" variant with 50 trees?
### Controlling the random forests

`missRanger()` offers a `...` argument to pass options to `ranger()`, e.g. `num.trees` or `min.node.size`. How would we use its "extremely randomized trees" variant with 50 trees?

``` {r}
irisImputed_et <- missRanger(
Expand All @@ -93,6 +97,8 @@ head(irisImputed_et)

It is as simple!

### Use in Pipe

{missRanger} also plays well together with the pipe:

```r
Expand All @@ -102,6 +108,8 @@ iris |>
head()
```

### Formula interface

By default `missRanger()` uses all columns in the data set to impute all columns with missings. To override this behaviour, you can use an intuitive formula interface: The left hand side specifies the variables to be imputed (variable names separated by a `+`), while the right hand side lists the variables used for imputation.

``` {r}
Expand Down Expand Up @@ -138,7 +146,7 @@ m <- missRanger(irisWithNA, . ~ 1, verbose = 0)
head(m)
```

## Imputation takes too much time. What can I do?
### Imputation takes too much time. What can I do?

`missRanger()` is based on iteratively fitting random forests for each variable with missing values. Since the underlying random forest implementation `ranger()` uses 500 trees per default, a huge number of trees might be calculated. For larger data sets, the overall process can take very long.

Expand All @@ -156,7 +164,7 @@ Here are tweaks to make things faster:

- Use a low `max.iter`, e.g. 1 or 2.

### Examples evaluated on a normal laptop (not run here)
Evaluated on a normal laptop:

```r
library(ggplot2) # for diamonds data
Expand Down Expand Up @@ -185,12 +193,10 @@ system.time(
)
```

## Trick: Use `case.weights` to weight down contribution of rows with many missings
### Trick: Use `case.weights` to weight down contribution of rows with many missings

Using the `case.weights` argument, you can pass case weights to the imputation models. This might be useful to weight down the contribution of rows with many missings.

### Example

``` {r}
# Count the number of non-missing values per row
non_miss <- rowSums(!is.na(irisWithNA))
Expand Down

0 comments on commit 9453a1e

Please sign in to comment.