Solved – Random forest: which features led to a certain prediction

random forestscikit learn

I have trained a random forest classifier using the sklearn Python package, and used it to classify a datapoint with a certain feature vector.

Let's assume that the random forest has only one tree, that this is a binary classification task, and the data point has been labeled as class '0', while I was expecting it to be '1'. How can I check which features were responsible for such classification? Is there a way to get the list of split-thresholds for each feature?

How can this be generalised to the multiclass case, with multiple trees?

Best Answer

In the canonical implementation of random forest (R's randomForest package), there is a way to produce a local importance matrix that tells you which feature(s) have contributed to the model's prediction.

library(randomForest)
set.seed(71)
iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE,
                        localImp=TRUE,
                        proximity=TRUE)

locImp = iris.rf$localImportance
dim(locImp)
[1]   4 150

The rows of locImp are the features, columns the observations. So locImp[,1] gives,

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
  0.02564103   0.01025641   0.32307692   0.37435897  

That says Petal.Width has the most weight in predicting setosa on the first observation.