Solved – Per cent increase in MSE (%IncMSE) random forests importance measure: why is mean prediction error divided by standard deviation

importancerandom forest

Random forests have their variable importance calculated using one of two methods, of which permutation-based importance is considered better. In R's randomForest package, this returns a measure called %IncMSE (or per cent increase in mean squared error) for regression cases. The calculation is explained clearly by @SorenHavelundWelling in this answer:

%IncMSE is the most robust and informative measure. It is the increase
in mse of predictions(estimated with out-of-bag-CV) as a result of
variable j being permuted(values randomly shuffled).

  1. grow regression forest. Compute OOB-mse, name this mse0.
  2. for 1 to j var: permute values of column j, then predict and compute OOB-mse(j)
  3. %IncMSE of j'th is (mse(j)-mse0)/mse0 * 100%

This matches my prior understanding as well. But the package documentation explains things this way (emphasis mine):

The first measure [%IncMSE] is computed from permuting OOB data: For each tree,
the prediction error on the out-of-bag portion of the data is recorded
(error rate for classification, MSE for regression). Then the same is
done after permuting each predictor variable. The difference between
the two are then averaged over all trees, and normalized by the
standard deviation of the differences
. If the standard deviation of
the differences is equal to 0 for a variable, the division is not done
(but the average is almost always equal to 0 in that case).

This normalization step does not really make sense to me, and is omitted by the answer quoted above as well. Why is this part of the calculation? And does the resulting value retain the meaning of a percentage change after this normalization? It wouldn't seem like it to me.

Best Answer

It seems analogous to the computation of an effect size. It reflects the mean increase in MSE the variable contributes, divided by a measure of its variability:

For each tree, we get a difference between two MSE values. Averaging over trees gives the mean difference between the two MSE values.

The standard deviation of the differences reflects the variation around the mean, a measure of residual error (cf. pooled standard deviation in ANOVA). Dividing the mean by this standard deviation gives an effect size (cf. Cohen's $d$ in ANOVA).

Note that it would thus be possible to obtain MSE increase > 100% (just like it is possible to obtain Cohen's $d$ effect size > 1; %IncMSE increase should not be interpreted as $R^2$):

> set.seed(42)
> x <- rnorm(1000)
> y <- x + rnorm(1000, sd = .01)
> library(randomForest)
> rf <- randomForest(x ~ y, importance = TRUE)
> importance(rf)
   %IncMSE IncNodePurity
y 293.7449      1004.795
Related Question