Solved – How to calculate variable importance p-values using the randomForest package in R

machine learningp-valuerrandom foreststatistical significance

For a classification project we are using the randomForest package in R, which wraps the Breiman Fortran random forest implementation, to assess the importance of each of our features. I would like to calculate p-values for each feature's importance statistics as described in the random forest documentation provided by Breiman.

… therefore we compute standard errors in the classical way, divide
the raw score by its standard error to get a z-score, ands assign a
significance level to the z-score assuming normality.

The R RandomForest package provides the mean decrease importance (MDI) metric for each of the classes and overall (both classes combined) in addition to the standard deviation for the decrease importance of each class and overall.

I don't understand how these values can be used to obtain a significance level for the variable importance as, while the mean and standard deviation will allow the construction of the a normal distribution, there is no "observation" for the z-score calculation. Can someone clarify how to do this?

Best Answer

It is called z-score mainly because it is mean/sd, but in practice it is useless for hypothesis testing -- in some cases you can get most important attribute with z~$10^{-3}$ or on the other side all z-scores way larger than this mystical 3.

The working (more-less) approach is for instance to compare attributes' importance to an importance of random dummy attributes added to the set. I have made a package, Boruta, that implements such idea.