library(randomForest)
data(iris)
fit <- randomForest(Species ~ ., data=iris, importance=TRUE);
fit.imp<-importance(fit)
fit.imp
columns 1-3 show the class-specific variable importance for the Mean Decrease Acuracy measure.
Note that for Sepal.Length, the class-specific VIs are lower than the Mean VI values for Accuracy.
I have two questions about the implementation of RandomForest in R:
1) How are the class-specific importances calculated (i.e., how is it possible for the class means to be lower than the total mean?)? I understand the theory of how permutation accuracy is calculated, but I am not a mathematician so reading the raw equations doesn't help me much. Any quick explanation would be much appreciated before I dive into the RF package source code.
2) Is there a way to calculate class-specific Gini metrics, not class-specific Accuracy metrics (the default)? I really want to do this. I was about to start trying to code a way to do it, but thought I would ask here first.
Best Answer
To retrieve a class specific gini-metric, you need to make a clear definition first. Is it computed by inbag samples and summed over OOB samples, like for VI, or what? You need to modify the source code of the package. In this thread is explained where and how the gini loss function is computed.
I'm not sure you actually want this class-spcific gini metric to compare forest with single tree. Gini impurity is already quite unstable across the entire forest. If computed for a single tree and split over many classes, it would be even more unstable.
No model is fully black-box, check out partial dependence plots from randomForest package itself or the extended version ICEbox for interactions(article).
Concerning speed. If you train 50 trees with reduced bootstrap sampsize, it usually only takes a few ms to predict. Do you really need to go faster? If so, consider to predict directly with the native c function from a non R environment.
Also here's a small example from my own package forestFloor of how to interpret a RF model trained on the iris data. The iris data set is very simple to visualize, because the features do not interact much.