Solved – Random forest: which features led to a certain prediction

random forestscikit learn

I have trained a random forest classifier using the sklearn Python package, and used it to classify a datapoint with a certain feature vector.

Let's assume that the random forest has only one tree, that this is a binary classification task, and the data point has been labeled as class '0', while I was expecting it to be '1'. How can I check which features were responsible for such classification? Is there a way to get the list of split-thresholds for each feature?

How can this be generalised to the multiclass case, with multiple trees?

Best Answer

In the canonical implementation of random forest (R's randomForest package), there is a way to produce a local importance matrix that tells you which feature(s) have contributed to the model's prediction.

library(randomForest)
set.seed(71)
iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE,
                        localImp=TRUE,
                        proximity=TRUE)

locImp = iris.rf$localImportance
dim(locImp)
[1]   4 150

The rows of locImp are the features, columns the observations. So locImp[,1] gives,

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
  0.02564103   0.01025641   0.32307692   0.37435897

That says Petal.Width has the most weight in predicting setosa on the first observation.

Related Solutions

Solved – Is random forest a boosting algorithm

Random Forest is a bagging algorithm rather than a boosting algorithm. They are two opposite way to achieve a low error.

We know that error can be composited from bias and variance. A too complex model has low bias but large variance, while a too simple model has low variance but large bias, both leading a high error but two different reasons. As a result, two different ways to solve the problem come into people's mind (maybe Breiman and others), variance reduction for a complex model, or bias reduction for a simple model, which refers to random forest and boosting.

Random forest reduces variance of a large number of "complex" models with low bias. We can see the composition elements are not "weak" models but too complex models. If you read about the algorithm, the underlying trees are planted "somewhat" as large as "possible". The underlying trees are independent parallel models. And additional random variable selection is introduced into them to make them even more independent, which makes it perform better than ordinary bagging and entitle the name "random".

While boosting reduces bias of a large number of "small" models with low variance. They are "weak" models as you quoted. The underlying elements are somehow like a "chain" or "nested" iterative model about the bias of each level. So they are not independent parallel models but each model is built based on all the former small models by weighting. That is so-called "boosting" from one by one.

Breiman's papers and books discuss about trees, random forest and boosting quite a lot. It helps you to understand the principle behind the algorithm.

Solved – Feature importance for random forest classification of a sample

Variable importance accounts for the increase in out-of-bag cross-validated prediction error. It would be possible but not meaningful to account for the change of prediction error by one sample only. As one sample only can be correctly or wrongly predicted, such a term would be very unstable and crude.

You could check out 'local variable importance', 'partial dependence plots' or 'feature contributions'. Here's an example from my package forestFloor using feature contributions. Each plot shows the change of predicted class probability as function each variable. For the iris data set, there no strong variable interactions. Therefore, the model structure can be boiled down to a 2D visualization. The R-sqaured terms quantifies how much the model structure deviates from this main effect only interpretation/visualization.

library(forestFloor)
library(randomForest)
data(iris)
X = iris[,!names(iris) %in% "Species"]
Y = iris[,"Species"]

rf = randomForest(X,Y,
                  keep.forest=TRUE, #mandatory for classification
                  replace=FALSE,    #if TRUE use trimTrees::cinbag, not randomForest
                  keep.inbag=TRUE,  #mandatory always for forestFloor
                  sampsize =15 )    #optional:smaller trees smoother model structure

ff = forestFloor(rf.fit  = rf,           # mandatory
                 X       = X,            # mandatory
                 calc_np = "sad monkey", # this input takes no effect for classification
                 binary_reg = FALSE)     # can change two class classification to regression
# Thus cannot be TRUE for IRIS (three class)

plot(ff,plot_GOF=TRUE,cex=.7,
     colLists=list(c("#FF0000A5"),
                   c("#00FF0050"),
                   c("#0000FF35")))

Best Answer

Related Solutions

Solved – Is random forest a boosting algorithm

Solved – Feature importance for random forest classification of a sample

Related Question