Solved – mtry and unbalanced use of predictor variables in Random Forest

importancerrandom forestsampling

I am working on the Random Forest prediction, with the focus on the importance of predictor variables, and have a question regarding understanding of mtry and the actual usage of variables in the trees of Random Forest in R (package randomForest).

I have a sample with over 1000 observations and a response vector with classification into 2 classes: 0 and 1. I have 4 independent variables. For the experiment I have chosen the following parameters:

rftry1=randomForest(x,y,xtest,ytest,mtry=4,ntree=500, importance=TRUE,keep.forest=TRUE,do.trace=TRUE,replace=TRUE,keep.inbag=TRUE,proximity=TRUE)

I then check which variables have been actually used in the forest, and find a very skewed usage:

> varUsed(rftry1)

[1] 2436 1758 1988 1156

The usage actually decreases dramatically from the first to the 4th variable. Would it actually affect prediction? Why does it happen like that?

I have then rerun the experiment with mtry=2, and the usage has reduced overall but still is proportional to the usage of the first one.

MeanDecreaseAccuracy MeanDecreaseGini

1 0.006535855 2.177148
2 0.224706591 127.106268
3 0.006633846 5.020456
4 0.017522580 36.867821

I have noticed that the frequency of usage of the variable in the trees does not affect the importance to the extent of changing the order of importance of variables. So my main 2 questions are:

1) I guess I do not understand what exactly mtry controls ( I have read the number of variables sampled at each split). I have noticed that with mtry 2 and 4 some of the variables have been reused in the tree, while with m=1 they have not been and the trees were much shorter. Do i understand it correctly when I say:

mtry=2 means that at every split instead of firmly choosing one next variable to split on we will randomly choose out of 2? If i have 4 variables overall, and variable 1 was used for very first split – what are my choices for left daugter and right daughter to split on?

2) Is mtry related to the unbalanced use of the variables for split? Why are the variables used like this, and does it affect the prediction outcome to make biased?

All my variables are in the form of factors where 3 have 2 levels and one has 3 levels.

Thank you for your answers!

Best Answer

The part of the overall random forest algorithm that uses mtry is (adapted from The Elements of Statistical Learning):

At each terminal node that is larger than minimal size,

1) Select mtry variables at random from the $p$ regressor variables,

2) From these mtry variables, pick the best variable and split point,

3) Split the node into two daughter nodes using the chosen variable and split point.

As an aside - you can use the tuneRF function in the randomForest package to select the "optimal" mtry for you, using the out-of-bag error estimate as the criterion.

The random selection of variables at each node splitting step is what makes it a random forest, as opposed to just a bagged estimator. Quoting from The Elements of Statistical Learning, p 588 in the second edition:

The idea in random forests ... is to improve the variance reduction of bagging by reducing the correlation between the trees, without increasing the variance too much. This is achieved in the tree-growing process through random selection of the input variables.

There is no incremental increase in bias due to this. Of course, if the model itself is fundamentally biased, e.g., by leaving out important predictor variables, using random forests won't make the situation any better, but it won't make it worse either.

The unbalanced use of predictor variables just reflects the fact that some are less important than others, where important is used in a heuristic rather than a formal sense, and as a consequence, for some trees, may not be used often or at all. For example, think about what would happen if you had a variable that was barely significant on the full data set, but you then generated a lot of bootstrap datasets from the full data set and ran the regression again on each bootstrap dataset. You can imagine that the variable would be insignificant on a lot of those bootstrap datasets. Now compare to a variable that was extremely highly significant on the full dataset; it would likely be significant on almost all of the bootstrap datasets too. So if you counted up the fraction of regressions for which each variable was "selected" by being significant, you'd get an unbalanced count across variables. This is somewhat (but only somewhat) analogous to what happens inside the random forest, only the variable selection is based on "best at each split" rather than "p-value < 0.05" or some such.

EDIT in response to a question by the OP: Note, however, that variable importance measures are not based solely on counts of how many times a variable is used in a split. Consequently, you can have "important" variables (as measured by "importance") that are used less often in splits than less "important" variables (as measured by "importance".) For example, consider the model:

$ y_i = I(x_i > c) + 0.25*z_i^2 + e_i$

as implemented and estimated by the following R code:

x <- runif(500)
z <- rnorm(500)
y <- (x>0.5) + z*z/4 + rnorm(500)
df <- data.frame(list(y=y,x=x,z=z,junk1=rnorm(500),junk2=runif(500),junk3=rnorm(500)))
foo <- randomForest(y~x+z+junk1+junk2+junk3,mtry=2,data=df)
importance(foo)
      IncNodePurity
x         187.38456
z         144.92088
junk1     102.41875
junk2      93.61086
junk3      92.59587

varUsed(foo)
[1] 16916 17445 16883 16434 16453

Here $x$ has higher importance, but $z$ is used more frequently in splits; $x$'s importance is high but in some sense very local, while $z$ is more important over the full range of $z$ values.

For a fuller discussion of random forests, see Chap. 15 of The Elements..., which the link above allows you to download as a pdf for free.