Solved – Variable importance randomForest negative values

feature selectionimportancemachine learningrandom forest

I am asking myself if it is a good idea to remove those variables with a negative variable importance value ("%IncMSE") in a regression context. And if it gives me a better prediction? What do you think?

Best Answer

Variable importance in Random forest is calculated as follows:

  1. Initially, MSE of the model is calculated with the original variables
  2. Then, the values of a single column are permuted and the MSE is calculated again. For example, If a column (Col1) takes the values 1,2,3,4, and a random permutation of the values results in 4,3,1,2. This results in an MSE1. Then an increase in the MSE, i.e., MSE1 - MSE, would signify the importance of the variable.

  3. We expect the difference to be positive, but in the cases of a negative number, it denotes that the random permutation worked better. It can be inferred that the variable does not have a role in the prediction,i.e, not important.

Hope this helps!

Please refer to the following link for a elaborated explanation!

https://stackoverflow.com/questions/27918320/what-does-negative-incmse-in-randomforest-package-mean