Data Imputation with Random Forests – Effective Techniques and Strategies

data miningdata-imputationmissing datapredictive-modelsrandom forest

I have two questions on using random forest (specifically randomForest in R) for missing value imputation (in the predictor space).

1) How does the imputation algorithm work – specifically how and why is the class label required for imputation? is the proximity matrix which serves to weight the average value to impute a missing value defined separately by class?

2) If the class label is needed to impute missing values – how can this be used to impute missing values for new data that you are trying to predict?

Best Answer

The basic idea is to do a quick replacement of missing data and then iteratively improve the missing imputation using proximity. To work with unlabeled data, just replicate the data with all labels, and then treat it as labeled data.

The fraction of trees for which a pair of observations share a terminal node gives the proximity matrix, and so explicitly uses the class label.

Training set:

  1. Replace missing values by the average value.
  2. Repeat until satisfied:

    a. Using imputed values calculated so far, train a random forest.

    b. Compute the proximity matrix.

    c. Using the proximity as the weight, impute missing values as the weighted average of non-missing values.

Test set:

  1. If labels exist, use the imputation derived from test data.
  2. If data is unlabeled, replicate the test set with a copy for each class label and proceed as with labeled data.

Here, (weighted) average refers to (weighted) median for numerical variables and (weighted) mode for categorical variables. 4-6 iterations are recommended in the references.

R documentation (pdf), Breiman's manual v4.0 (pdf), Breiman's RF page

Related Question