Machine Learning – Optimization Techniques for Random Forest Model

machine learningrandom forest

I'm training a Random Forests regression model in R using the randomForest package. My total number of variables, $M$, is 30, and my training sample size $N$ has more than 10,000 observations.
The model, in general, gives reasonable results (comparing to findings in literature) in terms of variable importance and non-linear interaction between the variables, with an overall decent performance (~80% var explained – don't think there is a threshold value determining good/no good tho, please correct me if I'm wrong).

I'm now tuning the model in order to achieve the highest performance possible acting on two parameters: number of trees, and number of variables per tree.
Quoting from Breiman & Cutler's website:
"…it was shown that the forest error rate depends on two things:

  • The correlation between any two trees in the forest.
  • The strength of each individual tree in the forest.

Reducing $m$ [number of variables per tree, mtry in R] reduces both the correlation and the strength. Increasing it increases both. Somewhere in between is an "optimal" range of $m$ – usually quite wide. Using the oob error rate a value of m in the range can quickly be found."

The model is little sensitive to changes in the number of trees, stabilizing after a certain threshold (as usually happens with RF – see also "How many trees in a random forest?" – Oshiro et al. 2012).

Regarding the number of variables per tree, randomForest for regression uses a default value $m=\frac{M}{3}$ (which is consistent with literature – see for instance "The Elements of Statistical Learning" – Friedman et al 2001, chap. 15). I used the tuneRF to calculate the sensitivity of the model to changes in m, but this operator is pretty unstable (if you run it several times, the results are always different, even if the seeds are set set.seed(1234567)).

This was discussed also in another post on StackOverflow, but my question is more from the theoretical point of view. I tuned manually the mtry and found the one that minimizes my Mean of squared residuals and maximizes my % Var explained, but is pretty far from the default $m=\frac{M}{3}$. Moreover, my model's results are sensitive to variations in mtry (i.e. even setting the seeds to a specific value for achieving reproducible model settings, for every mtry value variable importance changes rather radically). Increasing the mtry value increases the internal correlation of each specific tree: how is it possible that doubling the default mtry, I have an overall improvement? Could this be biased by multi-collinearity within the trees? In general, I'm struggling to find the "Somewhere in between" Braiman and Cutler referred to.

Any idea from more experienced practitioners (this is my first RF) would be much appreciated.

Could somebody explain me also why the variable relative importance could be so radically different changing the mtry even only by 1 unit? Being the model an ensemble of randomized processes, shouldn't the results converge?
Thanks.

Best Answer

with an overall decent performance (~80% var explained - don't think there is a threshold value determining good/no good tho, please correct me if I'm wrong).

Don't think you made a mistake, but check your code anyways. From my own experience and from previous questions in this forum, it is very common for new RF users to produce an over-confident out-of-bag cross validation (OOB-CV).

OOB is the cross validation regime, just as leave-one-out or 10fold CV or some nested regime. Any cross validation is computed by matching prediction of observations not used in training set. OOB is nice for random forest because you get it for free with no extra run time, because for any observation in training set, there is a set of trees that was trained interdependently of the observations. Explained variance is one metric to score how well a model performed by a given CV regime. You should choose or define a metric you find most useful, as a beginner just stick to explained variance or mean square error.

library(randomForest)

N=2000
M=6
X = data.frame(replicate(M,rnorm(N)))
y = with(X,X1*X2+rnorm(N))
rf = randomForest(X,y)
print(rf) # here's the printed OOB performance
Call:
 randomForest(x = X, y = y) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 2

          Mean of squared residuals: 1.281954
                    % Var explained: 32.5
tail(rf$rsq,1) #can also be fetched from here
[1] 0.325004
rf$predicted # OOB CV predictions
predict(rf)  # same OOB predictions
predict(rf,X) #oh no this is training predictions, never do that

#wow what a performance, sadly it is over-confident
1 - sum((predict(rf,X)-y)^2 )/sum((y-mean(y))^2) 
[1] 0.869875
#this is the OOB performance as calculated within the package
1 - sum((predict(rf  )-y)^2 )/sum((y-mean(y))^2)

Secondly if you tune by OOB-CV, the final OOB-CV of your chosen rf model is no longer an unbiased estimate of your model performance. To proceed very thoroughly, you would need an outer repeated cross validation. However if your model performance already explains 80% variance and your are only tuning mtry, I do not expect the OOB-CV to be way off. Maybe 5% worse... [edit: with 5% I'm not speaking of how much tweaking mtry will change OOB-CV performance. I say that OOB-CV suggest e.g. a 83% performance, but this estimate is no longer completely to be trusted. If you estimated the performance by 10outerfold-10innerfold-10 repeat you might find a performance of 78%]

How is it possible that doubling the default mtry, I have an overall improvement [of model performance meassured by OOB-CV]?

Yes, that is very possible. ´mtry´ values close $M$ will make the tree growing process more greedy. It will use the one/few dominant variable(s) first and explain as much of the target as possible and split by remaining variables way downwards in the tree. High mtry values gives trees with a low bias. For training sets with a low noise component, it makes sense. Your training set may simply contain a small set of high quality variables and some scraps. In that case a high mtry makes sense. If the training set contained a set of mostly redundant noisy variables, a low mtry would ensure relying evenly on all of them.

Could somebody explain me also why the variable relative importance could be so radically different changing the mtry even only by 1 unit?

First of all random forest is non-determistic and variable importance may vary. You can of course make the variable importance converge by growing a very high number of trees or repeat model training enough times. Make sure only to use permutation based variable importance measures, loss function based importance (gini or sqaured residuals, called type=2 in randomForest) is not really ever recommendable.

If mtry=1 you force the model to use all variables equally and the model importance tend to even. If mtry=$M$ your model will first use the dominant variables and rely much more on these, the variable importance will be relatively more unevenly distributed.

If you followed up by some sensitivity analysis, you would notice that the model predictions are more sensitive to the dominant variables, when mtry is relatively high. Here's an example of how the model structure is affected by mtry. The figure is from the appendix of my thesis. Very short it is a random forest model to predict molecular solubility as function of some standard molecular descriptors. Here, I use forestFloor to visualize the model structure. Notice when mtry=M=12 the trained model primarily relies on the dominant variable SlogP, whereas if mtry=1, the trained model relies almost evenly on SlogP, SMR and Weight. enter image description here

Other links:

A question about Dynamic Random Forest