Data Imputation with Random Forests – Effective Techniques and Strategies

data miningdata-imputationmissing datapredictive-modelsrandom forest

I have two questions on using random forest (specifically randomForest in R) for missing value imputation (in the predictor space).

1) How does the imputation algorithm work – specifically how and why is the class label required for imputation? is the proximity matrix which serves to weight the average value to impute a missing value defined separately by class?

2) If the class label is needed to impute missing values – how can this be used to impute missing values for new data that you are trying to predict?

Best Answer

The basic idea is to do a quick replacement of missing data and then iteratively improve the missing imputation using proximity. To work with unlabeled data, just replicate the data with all labels, and then treat it as labeled data.

The fraction of trees for which a pair of observations share a terminal node gives the proximity matrix, and so explicitly uses the class label.

Training set:

Replace missing values by the average value.
Repeat until satisfied:

a. Using imputed values calculated so far, train a random forest.

b. Compute the proximity matrix.

c. Using the proximity as the weight, impute missing values as the weighted average of non-missing values.

Test set:

If labels exist, use the imputation derived from test data.
If data is unlabeled, replicate the test set with a copy for each class label and proceed as with labeled data.

Here, (weighted) average refers to (weighted) median for numerical variables and (weighted) mode for categorical variables. 4-6 iterations are recommended in the references.

R documentation (pdf), Breiman's manual v4.0 (pdf), Breiman's RF page

Related Solutions

Solved – using random forest for missing data imputation in categorical variables ( in R)

In looks like you are interested in multiple imputations. See this link on ways you can impute / handle categorical data. The link discuss on details and how to do this in SAS.

The R package mice can handle categorical data for univariate cases using logistic regression and discriminant function analysis (see the link). If you use SAS proc mi is way to go [see link].

Edit:

You can use the function rfunsuper used in my answer for the another question.

p = 500
n = 200
mat <- matrix(NA, ncol = 500, nrow = 200)

for (i in 1:p){
if(i ==1){
     fs <- sample (c("AA", "AB", "AB", "BB"), n, replace=TRUE) 
     mat[,i] <- fs 
     fs1 <- fs 
     } 
rechr <- sample(1:n, 1)
fs1[rechr] <- sample (c("AA", "AB", "AB", "BB"), 1)
mat[,i] <- fs1 
}

dim(mat)

rowind <- sample(1:n, 20)
colind <- sample(1:p, 20)

for (i in 1:length(rowind)){
                mat[rowind[i],colind[i]] <- NA
}
mat[1:20,1:10]

out1 <- rfunsuper (data.frame(mat), iter=5, ntree=100)
diff.rel = 1.555556 / 50587.33 = 3.07499e-05 
diff.rel = 0.1111111 / 50590.67 = 2.196277e-06 

Call:
 randomForest(x = x.roughfixed, ntree = ntree) 
               Type of random forest: unsupervised
                     Number of trees: 100
No. of variables tried at each split: 22

Solved – the proper way to use rfImpute? (Imputation by Random Forest in R)

I'm not entirely sure if this is an answer to your question, but maybe you'll find it useful.

Maybe the author of the randomForest package would disagree with me, but I feel like the rfImpute() function is mostly used or called upon other imputation packages in their algorithms to impute many variables. If you only have one variable with missing data, then using this function as a stand alone may work. However, I think it is the case for most people that they have many variables with missing data in a datset that they'd like to impute. Enter the packages missForest and mice.

If you use the R package missForest, you can impute your entire dataset (many variables of different types may be missing) with one command missForest(). If I recall correctly, this function draws on the rfImpute() function from the randomForest package. For some reason (maybe others can elaborate), when you use the missForest() function, the other variables that are used to predict a single variable can also have missingness. So I think using this function and package are a nice idea if you are hoping to only get one dataset out, after all variables have been imputed.

The downside to using missForest() is that you only get one dataset, which does not allow you to take into account the uncertainty of your estimates (in your follow-on analytical models). So your analytical models will have incorrect confidence intervals if you just base the analysis on that one imputed dataset. If that doesn't matter to you, then I highly recommend this package and function, because it is very easy to use and specify your imputation model.

However, if you do need to get appropriate confidence intervals and pooled estimates in your analytical models, then you should probably use multivariate imputation by chained equations (MICE) approaches to imputation. For this, you can use the mice package. There is recent functionality within this package that allows you to specify which variables you'd like to impute with a random forest algorithm, and which you would like to use the usual methods (e.g. pmm). When specifying your imputation model with the mice() function, under methods you would do something like meth <- c("rfcat", "rfcont").

missForest has a nice vignette you can look up in R.

Here is a nice resource for how to set up your imputation models using mice:

http://www.stefvanbuuren.nl/publications/MICE%20in%20R%20-%20Draft.pdf

Best Answer

Related Solutions

Solved – using random forest for missing data imputation in categorical variables ( in R)

Solved – the proper way to use rfImpute? (Imputation by Random Forest in R)

Related Question