Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normal behaviour of missRanger compared with randomForestSRC #1

Closed
thierrygosselin opened this issue Dec 19, 2016 · 5 comments
Closed

Comments

@thierrygosselin
Copy link
Contributor

Hi Michael,

I gave missRanger a try, using genomic dataset with lots of missing genotypes (RADseq).
Could you tell me why randomForestSRC is able to impute the data below, but not missRanger ?

INDIVIDUALS GENOTYPE
1 001001
2 003003
3 001001
4 003003
5 003003
6 003001
7 003003
8 NA
9 001001
10 001001

I know imputing this would be unreliable, but apart from this, what's the solution if a complete dataset is required for an analysis ?

Best regards
Thierry

@mayer79
Copy link
Owner

mayer79 commented Dec 19, 2016

Hi Thierry

Thx for testing missRanger.

If "GENOTYPE" is an R-factor, missRanger should be able to provide some results:


mydata <- data.frame(
  x = 1:10, y =c(
 	"001001",
	"003003",
 	"001001",
 	"003003",
 	"003003",
	"003001",
 	"003003",
 	NA,
 	"001001",
 	"001001"), stringsAsFactors = T)

library(missRanger)
missRanger(mydata, seed = 100001)

would e.g. provide the following output on a Windows 10 PC with R version 3.3.2 and ranger version 0.6.0:

Missing value imputation by chained random forests
  missRanger iteration 1:.done
  missRanger iteration 2:.done
    x      y
1   1 001001
2   2 003003
3   3 001001
4   4 003003
5   5 003003
6   6 003001
7   7 003003
8   8 001001
9   9 001001
10 10 001001

Are you able to reproduce this example on your system?

@thierrygosselin
Copy link
Contributor Author

Arrr... sorry about that I took the wrong example from my data... here is the data that was not imputed with missRanger:

INDIVIDUALS GENOTYPE
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
7 NA
8 NA
9 002002
10 004004

I know it's a crazy example, but this is from empirical data, and randomForestSRC is able to impute this.

What's the best alternative, raise a flag for this marker and say not enough data ?

Cheers
Thierry

@mayer79
Copy link
Owner

mayer79 commented Dec 20, 2016

It is indeed a crazy example but nevertheless, I have fixed this unintended behaviour that happened if the algorithm has converged after the first iteration. Thanks for pointing this out.

From a statistical perspective, it is (usually) best to

  • drop columns with almost no information or

  • at least do multiple imputations, repeat the statistical analysis for each version of the data and then either combine their results or at least compare them to see how much they depend on the imputation.

@thierrygosselin
Copy link
Contributor Author

thanks Michael
RADseq data if you need crazy big data imputation problems!

@mayer79
Copy link
Owner

mayer79 commented Dec 20, 2016

Ha, I will check this out. I have made a clean new version 0.1.2, but technically it the same as the bug fixed 0.1.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants