Machine Learning – Optimization Techniques for Random Forest Model

machine learningrandom forest

I'm training a Random Forests regression model in R using the randomForest package. My total number of variables, $M$, is 30, and my training sample size $N$ has more than 10,000 observations.
The model, in general, gives reasonable results (comparing to findings in literature) in terms of variable importance and non-linear interaction between the variables, with an overall decent performance (~80% var explained – don't think there is a threshold value determining good/no good tho, please correct me if I'm wrong).

I'm now tuning the model in order to achieve the highest performance possible acting on two parameters: number of trees, and number of variables per tree.
Quoting from Breiman & Cutler's website:
"…it was shown that the forest error rate depends on two things:

The correlation between any two trees in the forest.
The strength of each individual tree in the forest.

Reducing $m$ [number of variables per tree, mtry in R] reduces both the correlation and the strength. Increasing it increases both. Somewhere in between is an "optimal" range of $m$ – usually quite wide. Using the oob error rate a value of m in the range can quickly be found."

The model is little sensitive to changes in the number of trees, stabilizing after a certain threshold (as usually happens with RF – see also "How many trees in a random forest?" – Oshiro et al. 2012).

Regarding the number of variables per tree, randomForest for regression uses a default value $m=\frac{M}{3}$ (which is consistent with literature – see for instance "The Elements of Statistical Learning" – Friedman et al 2001, chap. 15). I used the tuneRF to calculate the sensitivity of the model to changes in m, but this operator is pretty unstable (if you run it several times, the results are always different, even if the seeds are set set.seed(1234567)).

This was discussed also in another post on StackOverflow, but my question is more from the theoretical point of view. I tuned manually the mtry and found the one that minimizes my Mean of squared residuals and maximizes my % Var explained, but is pretty far from the default $m=\frac{M}{3}$. Moreover, my model's results are sensitive to variations in mtry (i.e. even setting the seeds to a specific value for achieving reproducible model settings, for every mtry value variable importance changes rather radically). Increasing the mtry value increases the internal correlation of each specific tree: how is it possible that doubling the default mtry, I have an overall improvement? Could this be biased by multi-collinearity within the trees? In general, I'm struggling to find the "Somewhere in between" Braiman and Cutler referred to.

Any idea from more experienced practitioners (this is my first RF) would be much appreciated.

Could somebody explain me also why the variable relative importance could be so radically different changing the mtry even only by 1 unit? Being the model an ensemble of randomized processes, shouldn't the results converge?
Thanks.

Best Answer

with an overall decent performance (~80% var explained - don't think there is a threshold value determining good/no good tho, please correct me if I'm wrong).

Don't think you made a mistake, but check your code anyways. From my own experience and from previous questions in this forum, it is very common for new RF users to produce an over-confident out-of-bag cross validation (OOB-CV).

OOB is the cross validation regime, just as leave-one-out or 10fold CV or some nested regime. Any cross validation is computed by matching prediction of observations not used in training set. OOB is nice for random forest because you get it for free with no extra run time, because for any observation in training set, there is a set of trees that was trained interdependently of the observations. Explained variance is one metric to score how well a model performed by a given CV regime. You should choose or define a metric you find most useful, as a beginner just stick to explained variance or mean square error.

library(randomForest)

N=2000
M=6
X = data.frame(replicate(M,rnorm(N)))
y = with(X,X1*X2+rnorm(N))
rf = randomForest(X,y)
print(rf) # here's the printed OOB performance
Call:
 randomForest(x = X, y = y) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 2

          Mean of squared residuals: 1.281954
                    % Var explained: 32.5
tail(rf$rsq,1) #can also be fetched from here
[1] 0.325004
rf$predicted # OOB CV predictions
predict(rf)  # same OOB predictions
predict(rf,X) #oh no this is training predictions, never do that

#wow what a performance, sadly it is over-confident
1 - sum((predict(rf,X)-y)^2 )/sum((y-mean(y))^2) 
[1] 0.869875
#this is the OOB performance as calculated within the package
1 - sum((predict(rf  )-y)^2 )/sum((y-mean(y))^2)

Secondly if you tune by OOB-CV, the final OOB-CV of your chosen rf model is no longer an unbiased estimate of your model performance. To proceed very thoroughly, you would need an outer repeated cross validation. However if your model performance already explains 80% variance and your are only tuning mtry, I do not expect the OOB-CV to be way off. Maybe 5% worse... [edit: with 5% I'm not speaking of how much tweaking mtry will change OOB-CV performance. I say that OOB-CV suggest e.g. a 83% performance, but this estimate is no longer completely to be trusted. If you estimated the performance by 10outerfold-10innerfold-10 repeat you might find a performance of 78%]

How is it possible that doubling the default mtry, I have an overall improvement [of model performance meassured by OOB-CV]?

Yes, that is very possible. ´mtry´ values close $M$ will make the tree growing process more greedy. It will use the one/few dominant variable(s) first and explain as much of the target as possible and split by remaining variables way downwards in the tree. High mtry values gives trees with a low bias. For training sets with a low noise component, it makes sense. Your training set may simply contain a small set of high quality variables and some scraps. In that case a high mtry makes sense. If the training set contained a set of mostly redundant noisy variables, a low mtry would ensure relying evenly on all of them.

Could somebody explain me also why the variable relative importance could be so radically different changing the mtry even only by 1 unit?

First of all random forest is non-determistic and variable importance may vary. You can of course make the variable importance converge by growing a very high number of trees or repeat model training enough times. Make sure only to use permutation based variable importance measures, loss function based importance (gini or sqaured residuals, called type=2 in randomForest) is not really ever recommendable.

If mtry=1 you force the model to use all variables equally and the model importance tend to even. If mtry=$M$ your model will first use the dominant variables and rely much more on these, the variable importance will be relatively more unevenly distributed.

If you followed up by some sensitivity analysis, you would notice that the model predictions are more sensitive to the dominant variables, when mtry is relatively high. Here's an example of how the model structure is affected by mtry. The figure is from the appendix of my thesis. Very short it is a random forest model to predict molecular solubility as function of some standard molecular descriptors. Here, I use forestFloor to visualize the model structure. Notice when mtry=M=12 the trained model primarily relies on the dominant variable SlogP, whereas if mtry=1, the trained model relies almost evenly on SlogP, SMR and Weight.

Related Solutions

Solved – Random Forest: what if I know a variable is important

Note that mtry is the number of variables randomly sampled as candidates at each split. And from this candidates the best is choosen to perform splitting. Thus the proportion you have mentioned is not satisfied completely. More important variables appear more frequently, and less important – less frequently. So if the variable is really very important, then there is a great probability that it will be picked in a tree and you do not need manual correction. But sometimes (rarely) there is a need to force the presence of some variable (regardless of its possible importance) in the regression. As far as I know R package random forest does not support such possibility. But if this variable has no intercorrelation with others you can make ordinary regression with this variable as single term and then run random forest regression on the residuals of this ordinary regression. If you still want to correct the possibility of choosing prespecified variables, then modification of source code with next compilation is your option.

Solved – mtry and unbalanced use of predictor variables in Random Forest

The part of the overall random forest algorithm that uses mtry is (adapted from The Elements of Statistical Learning):

At each terminal node that is larger than minimal size,

1) Select mtry variables at random from the $p$ regressor variables,

2) From these mtry variables, pick the best variable and split point,

3) Split the node into two daughter nodes using the chosen variable and split point.

As an aside - you can use the tuneRF function in the randomForest package to select the "optimal" mtry for you, using the out-of-bag error estimate as the criterion.

The random selection of variables at each node splitting step is what makes it a random forest, as opposed to just a bagged estimator. Quoting from The Elements of Statistical Learning, p 588 in the second edition:

The idea in random forests ... is to improve the variance reduction of bagging by reducing the correlation between the trees, without increasing the variance too much. This is achieved in the tree-growing process through random selection of the input variables.

There is no incremental increase in bias due to this. Of course, if the model itself is fundamentally biased, e.g., by leaving out important predictor variables, using random forests won't make the situation any better, but it won't make it worse either.

The unbalanced use of predictor variables just reflects the fact that some are less important than others, where important is used in a heuristic rather than a formal sense, and as a consequence, for some trees, may not be used often or at all. For example, think about what would happen if you had a variable that was barely significant on the full data set, but you then generated a lot of bootstrap datasets from the full data set and ran the regression again on each bootstrap dataset. You can imagine that the variable would be insignificant on a lot of those bootstrap datasets. Now compare to a variable that was extremely highly significant on the full dataset; it would likely be significant on almost all of the bootstrap datasets too. So if you counted up the fraction of regressions for which each variable was "selected" by being significant, you'd get an unbalanced count across variables. This is somewhat (but only somewhat) analogous to what happens inside the random forest, only the variable selection is based on "best at each split" rather than "p-value < 0.05" or some such.

EDIT in response to a question by the OP: Note, however, that variable importance measures are not based solely on counts of how many times a variable is used in a split. Consequently, you can have "important" variables (as measured by "importance") that are used less often in splits than less "important" variables (as measured by "importance".) For example, consider the model:

$ y_i = I(x_i > c) + 0.25*z_i^2 + e_i$

as implemented and estimated by the following R code:

x <- runif(500)
z <- rnorm(500)
y <- (x>0.5) + z*z/4 + rnorm(500)
df <- data.frame(list(y=y,x=x,z=z,junk1=rnorm(500),junk2=runif(500),junk3=rnorm(500)))
foo <- randomForest(y~x+z+junk1+junk2+junk3,mtry=2,data=df)
importance(foo)
      IncNodePurity
x         187.38456
z         144.92088
junk1     102.41875
junk2      93.61086
junk3      92.59587

varUsed(foo)
[1] 16916 17445 16883 16434 16453

Here $x$ has higher importance, but $z$ is used more frequently in splits; $x$'s importance is high but in some sense very local, while $z$ is more important over the full range of $z$ values.

For a fuller discussion of random forests, see Chap. 15 of The Elements..., which the link above allows you to download as a pdf for free.

Best Answer

Related Solutions

Solved – Random Forest: what if I know a variable is important

Solved – mtry and unbalanced use of predictor variables in Random Forest

Related Question