Solved – Random Forest: Class-specific Gini variable importance in R

ginimachine learningrrandom forest

library(randomForest)
data(iris)
fit <- randomForest(Species ~ ., data=iris, importance=TRUE);
fit.imp<-importance(fit)
fit.imp

columns 1-3 show the class-specific variable importance for the Mean Decrease Acuracy measure.
Note that for Sepal.Length, the class-specific VIs are lower than the Mean VI values for Accuracy.

I have two questions about the implementation of RandomForest in R:

1) How are the class-specific importances calculated (i.e., how is it possible for the class means to be lower than the total mean?)? I understand the theory of how permutation accuracy is calculated, but I am not a mathematician so reading the raw equations doesn't help me much. Any quick explanation would be much appreciated before I dive into the RF package source code.

2) Is there a way to calculate class-specific Gini metrics, not class-specific Accuracy metrics (the default)? I really want to do this. I was about to start trying to code a way to do it, but thought I would ask here first.

Best Answer

To retrieve a class specific gini-metric, you need to make a clear definition first. Is it computed by inbag samples and summed over OOB samples, like for VI, or what? You need to modify the source code of the package. In this thread is explained where and how the gini loss function is computed.

I'm not sure you actually want this class-spcific gini metric to compare forest with single tree. Gini impurity is already quite unstable across the entire forest. If computed for a single tree and split over many classes, it would be even more unstable.

No model is fully black-box, check out partial dependence plots from randomForest package itself or the extended version ICEbox for interactions(article).

Concerning speed. If you train 50 trees with reduced bootstrap sampsize, it usually only takes a few ms to predict. Do you really need to go faster? If so, consider to predict directly with the native c function from a non R environment.

Also here's a small example from my own package forestFloor of how to interpret a RF model trained on the iris data. The iris data set is very simple to visualize, because the features do not interact much.

library(forestFloor)
library(randomForest)
set.seed(1)
data(iris)
X = iris[,names(iris)!="Species"]
y = iris[,"Species"]
rf = randomForest(X,y,
                     keep.inbag = T, #mandatory
                     replace=F,      #if true use, trimTrees::cinbag
                     importance=T)   #recomended
ff   = forestFloor(rf,X)
plot(ff,colLists = list('#DD101060','#10DD1060','#1010DD60'),plot_GOF=T)
#colours are individual classes
#y-axis is the additive change change of predicted probability as a function of separate feature values.
#x-axis is individual feature values
#lines quantify goodness-of-fit, that is how well the model could be explained in these 2D visualiations

enter image description here