Solved – RandomForest MeanDecreaseAccuracy interpretation

accuracyimportancerrandom forest

I know there are already some questions regarding the interpretation of the MeanDecreaseAccuracy metric of the randomForest-package, but it's still unclear to me. My assumption was that each variable is replaced by a randomly permutated version and the OOB error rate is recalculated. The increase of the error rate tells me the importance of the variable.

So let's take the variable "Start" in the example below. Can I say, that the expected increase in the current error rate (20.99%) is (15.8%–>20.99+15.8) if I would remove "Start" from my model?

I guess not, but how would the correct formulation sound like?

Thanks!

library(rpart)
library(randomForest)
set.seed(123)

fit <- randomForest(Kyphosis ~ Age + Number + Start , data=kyphosis, importance=TRUE, ntree=500)
print(fit)
importance(fit)


OOB estimate of  error rate: 20.99%
Confusion matrix:
        absent present class.error
absent      59       5   0.0781250
present     12       5   0.7058824

> importance(fit)
           absent   present MeanDecreaseAccuracy MeanDecreaseGini
Age    1.47962289  5.177725             4.347754         8.617383
Number 0.06287608  3.407481             2.909213         5.474125
Start  9.45380414 14.290110            15.809382         9.977147

Best Answer

Variable inportance(VI), is (1)training (2)predict OOB (2)permuting each variable in turns and re-predict OOB. (4) substract OOB-performance'2' OOB-performances in'3'. VI,(%mse or %class error) will only under some conditions approximate the loss of prediction performance if omitting a given variable from training. To understand when and when not, try imagine that two variables can hold some redundant and/or complimentary information. In fact, if these two variable variables are completely redundant e.g. by having a Spearman correlation = 1(non-parametric Pearson), these two variables will be interchangeable during growing the forest. Thus an equal amount of splits will rely on these two variables. The VI of both variables are the same. But if growing a new forest with one of the redundant variables omitted, the prediction performance could be almost unchanged, whereas the VI of the remaining redundant variable would double(if no other redundancies). The prediction performance would first decline, if regrowing the forest a third time omitting both redundant variables.

Variables can also be complimentary or even inter-dependent. E.g. a variable(lab-source) describing which lab produced a test result may be completely unuseful knowledge unless the test result itself(lab-result) is known also. If omitting the "lab-result" variable before training, then the 'lab-source' variable would have a lower variable importance. When a RF model essentially have captured a stong pair-wise variable interaction, VI can understate the loss of prediction performance by omitting one of the variables, as it is infact rendering another variable unuseful.

Such deviations between prediction performance and VI due to omitting/permuting variables before and after growing the forest can be used to screen for pair-wise variable redundancy and 'reliance'.

VI is mainly used for variable selection by ranking. The absolute value of VI is often regarded as arbitrary.