The randomForest package in R has the importance() function to get both node impurity and mean premutation importance for variables. Why, when calculating mean permutation importance, do the results not sum to 100%?
Here's a simple reproducible example:
library(randomForest)
data(iris)
iris.rf <- randomForest(Species~., importance = TRUE, data = iris, ntrees=1000)
imp <- importance(iris.rf, type = 1)
sum_imp <- sum(imp)
sum_imp # != 100
Thanks
Best Answer
As far as I can tell, variable importance is measuring either: a) the percentage that the prediction error increases when the variable is removed, or b) the change in the purity of each node when the variable is removed. (Averaged over all trees in the forest.) Neither of these is a probability, so there's no reason they should add up to 100%.
You can, of course, divide by the sum of all importances to get a percentage, but I think that would create confusion: you now have a percentage of what exactly?
(Welcome to the site!)