Solved – How to impute a missing categorical predictor variable for a random forest model

missing datarrandom forest

I have a set of x, y data I'm using to build a random forest. The x data is a vector of values that includes some NAs. So I use rfImpute to handle the missing data and create a random forest. Now I have a new unseen observation x (with an NA) and I want to predict y. How do I impute the missing value so that I may use the random forest that I have already grown? The rfImpute function seems to require x and y. I only have x for prediction purposes.

My question is similar (but different) to this question. And for example, I can use the same iris dataset. If I've correctly interpreted the code in the answer to the question I reference, the code iris.na[148, , drop=FALSE] in the statement iris.na2 = rbind(iris.imputed, iris.na[148, , drop=FALSE]) represents the new data which includes the Species (the Y value). In my problem I would not know the Species—I want to use the random forest to predict that. I would have the 4 independent variables, but some might be NA for a given row. To continue the analogy, imagine I have 3 of the 4 variables (one is missing). I want to impute that value. Then I want to predict the species which I do not know.

In response to gung's comment that I should add an illustration, let me put it in terms of the iris data set. Imagine I have the following data on a flower. I know it's Sepal.Length, Sepal.Width, Petal.Length, but not the Petal.Width. I'd like to impute the Petal.Width and then use those 4 values within a RF model to predict the Species.

Best Answer

I think you need an unsupervised imputing method. That is one which do not use the target values for imputation. If you only have few prediction feature vectors, it may be difficult to uncover a data structure. Instead you could mix your predictions with already imputed training feature vectors and use this structure to impute once again. Notice this procedure may violate assumptions of independence, therefore wrap the entire procedure in an outer cross-validation to check for serious overfitting.

I just learned about missForest from a comment to this question. missForest seems to do the trick. I simulated your problem on the iris data. (without outer cross-validation)

rm(list=ls())
data("iris")
set.seed(1234)
n.train = 100
train.index = sample(nrow(iris),n.train)
feature.train = as.matrix(iris[ train.index,1:4])
feature.test  = as.matrix(iris[-train.index,1:4])


#simulate 40 NAs in train
n.NAs = 40
NA.index = sample(length(feature.train),n.NAs)
NA.feature.train = feature.train; NA.feature.train[NA.index] = NA

#imputing 40 NAs unsupervised
library(missForest)
imp.feature.train = missForest(NA.feature.train)$ximp
#check how well imputation went, seems promsing for this data set
plot(    feature.train[NA.index],xlab="true value",
     imp.feature.train[NA.index],ylab="imp  value",)

#simulate random NAs in feature test
feature.test[sample(length(feature.test),20)] = NA

#mix feature.test with imp.feature.train
nrow.test = nrow(feature.test)
mix.feature = rbind(feature.test,imp.feature.train)
imp.feature.test = missForest(mix.feature)$ximp[1:nrow.test,]

#train RF and predict
library(randomForest)
rf = randomForest(imp.feature.train,iris$Species[train.index])
pred.test = predict(rf,imp.feature.test)
table(pred.test, iris$Species[-train.index])

Printing...
-----------------
pred.test    setosa versicolor virginica
  setosa         12          0         0
  versicolor      0         20         2
  virginica       0          1        15

Related Solutions

Solved – How to get class probabilities for unsupervised random forest

In unsupervised case randomForest produces a proximity matrix that you can use for clustering.

library(randomForest)
g <- randomForest(iris[,-5], keep.forest=FALSE, proximity=TRUE)
mds <- MDSplot(g, iris$Species, k=2, pch=16, palette=c("skyblue", "orange", "darkblue"))
library(cluster)
clusters_pam <- pam(1-g$proximity, k=3, diss = TRUE)
plot(mds$points[, 1], mds$points[, 2], pch=clusters_pam$clustering+14, col=c("skyblue", "orange", "darkblue")[as.numeric(iris$Species)])
legend("bottomleft", legend=unique(clusters_pam$clustering), pch = 15:17, title = "PAM cluster")
legend("topleft", legend=unique(iris$Species), pch = 16, col=c("skyblue", "orange", "darkblue"), title = "Iris species")

MDS stands for Multi-dimensional Scaling.

Of course the clusters won't one-on-one map to original classes (that's why I deliberately didn't remap clusters - so it's not a confusion matrix:

table(clusters_pam$clustering, iris$Species)

    setosa versicolor virginica
  1     50          0         0
  2      0          9        42
  3      0         41         8

Two dimensional MDS plot:

Then you can use your clusters as classes to train a supervised model:

g_new <- randomForest(x=iris[,-5], y=as.factor(clusters_pam$clustering), keep.forest=TRUE, proximity=TRUE)
table(predict(g_new, iris[,-5]), clusters_pam$clustering)

     1  2  3
  1 50  0  0
  2  0 51  0
  3  0  0 49

For the sake of our example and because Iris dataset is so short, we generate a simulated Iris dataset:

library(semiArtificial) # to generate dummy data for testing 
# create tree ensemble generator for classification problem
irisGenerator<- treeEnsemble(Species~., iris, noTrees=100)
# use the generator to create new data
irisNew <- newdata(irisGenerator, size=200)

Now we can predict on the new dataset and inspect how it is in agreement with the simulated dataset's species class:

table(predict(g_new, irisNew[,-5]), irisNew$Species)

    setosa versicolor virginica
  1     66          1         4
  2      1          7        56
  3      5         55         5

To predict probabilities:

predict(g_new, irisNew[,-5], type="prob")

        1     2     3
1   1.000 0.000 0.000
2   0.014 0.002 0.984
3   0.000 0.000 1.000
4   1.000 0.000 0.000
5   0.020 0.068 0.912
6   0.000 1.000 0.000
7   1.000 0.000 0.000
8   0.480 0.000 0.520
9   0.526 0.000 0.474
10  1.000 0.000 0.000

Principal Component Analysis (PCA) – How to Interpret the Output

Yes. This is the correct interpretation.
Yes, rotation values indicate the component loading values. This is confirmed by the prcomp documentation, though I'm not sure why they label this part of the aspect "Rotation", as it implies the loadings have been rotated using some orthogonal (likely) or oblique (less likely) method.
While it does appear to be the case that Sepal.Length, Petal.Length, and Petal.Width are all positively associated, I would not put as much stock in the small negative loading of Sepal.Width on PC1; it loads much more strongly (almost exclusively) on PC2. To be clear, Sepal.Width is still likely negatively associated with the other three variables, but it just doesn't seem to be strongly related to the first principle component.
Based on this question, I wonder whether you would be better served by using a common factor (CF) analysis, rather than a principle components analysis (PCA). CF is more of an appropriate data-reducing technique when your goal is to uncover meaningful theoretical dimensions--such as the plant-factor that you are hypothesizing may affect Sepal.Length, Petal.Length, and Petal.Width. I appreciate you're from some sort of biological science--botany perhaps--but there's some good writing in Psychology on the PCA v. CF distinction by Fabrigar et al., 1999, Widaman, 2007, and others. The core distinction between the two is that PCA assumes that all variances is true-score variance--no error is assumed--whereas CF partitions true score variance from error variance, before factors are extracted and factor loadings are estimated. Ultimately you might get a similar-looking solution--people sometimes do--but when they do diverge, it tends to be the case that PCA overestimate loading values, and underestimates the correlations between components. An additional perk of the CF approach is that you can use maximum likelihood estimation to perform significance tests of loading values, while also getting some indexes of how well your chosen solution (1 factor, 2 factors, 3 factors, or 4 factors) explains your data.
I would plot the factor loading values as you have, without weighting their bars by the proportion of variance for their respective components. I understand what you want to try to show by such an approach, but I think it would likely lead to readers to misunderstanding the component loading values from your analysis. However, if you wanted a visual way of showing the relative magnitude of variance accounted for by each component, you might consider manipulating the opacity of the groups bars (if you're using ggplot2, I believe this is done with the alpha aesthetic), based on the proportion of variance explained by each component (i.e., more solid colors = more variance explained). However, in my experience, your figure is not a typical way of presenting the results of a PCA--I think a table or two (loadings + variance explained in one, component correlations in another) would be much more straightforward.

References

Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Evaluating the use of exploratory factor analysis in psychological research. Psychological Methods, 4, 272-299.

Widaman, K. F. (2007). Common factors versus components: Principals and principles, errors, and misconceptions. In R. Cudeck & R. C. MacCallum (Eds.), Factor analysis at 100: Historic developments and future directions (pp. 177-203). Mahwah, NJ: Lawrence Erlbaum.

Best Answer

Related Solutions

Solved – How to get class probabilities for unsupervised random forest

Principal Component Analysis (PCA) – How to Interpret the Output

Related Question