Solved – Are randomForest variable importance values comparable across same variables on different dates

data miningfeature selectionrrandom forest

Are randomForest variable importance comparable across same variables on different dates?

I have a data array X which is of size $T\times N\times K$, where $T=1500$, $N=1500$ and $K=10$.

Physically, the 1st index $1,2,\ldots,T$ denotes days, while the 2nd index $1,2,\ldots,N$ represents locations, and the 3rd index $1,2,\ldots,K$ represents the $K$ features/variables measured at each location on each day.

The dependent variable is another array $Y$ which is of size $T\times N$.

Now I run randomForest on each date:

library(randomForest)

importanceValues=matrix(0, T, 10)

for (i in 1:T)

{
    y=Y[i, ]

    x1=X[i, ,1]
    x2=X[i, ,2]
    x3=X[i, ,3]
    x4=X[i, ,4]
    x5=X[i, ,5]
    x6=X[i, ,6]
    x7=X[i, ,7]
    x8=X[i, ,8]
    x9=X[i, ,9]
    x10=X[i, ,10]

    rf=randomForest(y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10, importance=T, na.action=na.omit)

    importanceValues[i, ]=rf$importance[, 2]

}

As you can see, I can obtain the variable importance values across dates:

For example, on the last date above, we have:

> rf$importance
                          %IncMSE      IncNodePurity
x1                          311.0803     1113618.9
x2                         4627.7532     3415010.7
x3                         8527.4607     4916842.7
x4                         3507.1872     2919601.3
x5                         2982.0577     2907352.5
x6                         5673.6522     5247811.5
x7                         3893.7793     3618126.4
x8                          135.2311      248212.5
x9                         1759.8080     2334093.9
x10                         852.3294     1562279.1

My questions are:

  1. Which one is more useful? (IncMSE or IncNodePurity?)
  2. How do I explain to a non-dataminer what "IncNodePurity" is?
  3. What's the unit of the "IncNodePurity" column? And can I compare these numbers across dates?
  4. One one date, e.g. 9/18/2008, most of the "IncNodePurity" numbers are much larger than those of another date, e.g. 6/1/2012.
    What can I say about the data sets on these two different dates? (They are different observations for the same variables on different dates)

Thank you!

Best Answer

Ad 1. IncMSE is an actual result of cross-bag test, so in theory it is better than IncNodePurity which is a training by-product.

Ad 3. & 4. To be honest, those values have a little sense of their own -- they depend on how good RF is on a current test, and this is terribly variable. If you want to compare anything, compare rankings calculated on that data.

Ad 2. This way it is rather bogus to push the meaning of both measures further than to just an importance score.

Related Question