normal behaviour of `missRanger` compared with `randomForestSRC` #1

thierrygosselin · 2016-12-19T01:26:01Z

Hi Michael,

I gave missRanger a try, using genomic dataset with lots of missing genotypes (RADseq).
Could you tell me why randomForestSRC is able to impute the data below, but not missRanger ?

INDIVIDUALS	GENOTYPE
1	001001
2	003003
3	001001
4	003003
5	003003
6	003001
7	003003
8	NA
9	001001
10	001001

I know imputing this would be unreliable, but apart from this, what's the solution if a complete dataset is required for an analysis ?

Best regards
Thierry

The text was updated successfully, but these errors were encountered:

mayer79 · 2016-12-19T10:52:01Z

Hi Thierry

Thx for testing missRanger.

If "GENOTYPE" is an R-factor, missRanger should be able to provide some results:


mydata <- data.frame(
  x = 1:10, y =c(
 	"001001",
	"003003",
 	"001001",
 	"003003",
 	"003003",
	"003001",
 	"003003",
 	NA,
 	"001001",
 	"001001"), stringsAsFactors = T)

library(missRanger)
missRanger(mydata, seed = 100001)

would e.g. provide the following output on a Windows 10 PC with R version 3.3.2 and ranger version 0.6.0:

Missing value imputation by chained random forests
  missRanger iteration 1:.done
  missRanger iteration 2:.done
    x      y
1   1 001001
2   2 003003
3   3 001001
4   4 003003
5   5 003003
6   6 003001
7   7 003003
8   8 001001
9   9 001001
10 10 001001

Are you able to reproduce this example on your system?

thierrygosselin · 2016-12-19T16:06:30Z

Arrr... sorry about that I took the wrong example from my data... here is the data that was not imputed with missRanger:

INDIVIDUALS	GENOTYPE
1	NA
2	NA
3	NA
4	NA
5	NA
6	NA
7	NA
8	NA
9	002002
10	004004

I know it's a crazy example, but this is from empirical data, and randomForestSRC is able to impute this.

What's the best alternative, raise a flag for this marker and say not enough data ?

Cheers
Thierry

mayer79 · 2016-12-20T12:48:45Z

It is indeed a crazy example but nevertheless, I have fixed this unintended behaviour that happened if the algorithm has converged after the first iteration. Thanks for pointing this out.

From a statistical perspective, it is (usually) best to

drop columns with almost no information or
at least do multiple imputations, repeat the statistical analysis for each version of the data and then either combine their results or at least compare them to see how much they depend on the imputation.

thierrygosselin · 2016-12-20T15:34:11Z

thanks Michael
RADseq data if you need crazy big data imputation problems!

mayer79 · 2016-12-20T15:40:54Z

Ha, I will check this out. I have made a clean new version 0.1.2, but technically it the same as the bug fixed 0.1.1.

thierrygosselin closed this as completed Dec 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

normal behaviour of `missRanger` compared with `randomForestSRC` #1

normal behaviour of `missRanger` compared with `randomForestSRC` #1

thierrygosselin commented Dec 19, 2016

mayer79 commented Dec 19, 2016

thierrygosselin commented Dec 19, 2016

mayer79 commented Dec 20, 2016

thierrygosselin commented Dec 20, 2016

mayer79 commented Dec 20, 2016

normal behaviour of missRanger compared with randomForestSRC #1

normal behaviour of missRanger compared with randomForestSRC #1

Comments

thierrygosselin commented Dec 19, 2016

mayer79 commented Dec 19, 2016

thierrygosselin commented Dec 19, 2016

mayer79 commented Dec 20, 2016

thierrygosselin commented Dec 20, 2016

mayer79 commented Dec 20, 2016

normal behaviour of `missRanger` compared with `randomForestSRC` #1

normal behaviour of `missRanger` compared with `randomForestSRC` #1