The part of the overall random forest algorithm that uses mtry
is (adapted from The Elements of Statistical Learning):
At each terminal node that is larger than minimal size,
1) Select mtry variables at random from the $p$ regressor variables,
2) From these mtry variables, pick the best variable and split point,
3) Split the node into two daughter nodes using the chosen variable and split point.
As an aside - you can use the tuneRF
function in the randomForest
package to select the "optimal" mtry
for you, using the out-of-bag error estimate as the criterion.
The random selection of variables at each node splitting step is what makes it a random forest, as opposed to just a bagged estimator. Quoting from The Elements of Statistical Learning, p 588 in the second edition:
The idea in random forests ... is to improve the variance reduction of bagging by reducing the correlation between the trees, without increasing the variance too much. This is achieved in the tree-growing process through random selection of the input variables.
There is no incremental increase in bias due to this. Of course, if the model itself is fundamentally biased, e.g., by leaving out important predictor variables, using random forests won't make the situation any better, but it won't make it worse either.
The unbalanced use of predictor variables just reflects the fact that some are less important than others, where important is used in a heuristic rather than a formal sense, and as a consequence, for some trees, may not be used often or at all. For example, think about what would happen if you had a variable that was barely significant on the full data set, but you then generated a lot of bootstrap datasets from the full data set and ran the regression again on each bootstrap dataset. You can imagine that the variable would be insignificant on a lot of those bootstrap datasets. Now compare to a variable that was extremely highly significant on the full dataset; it would likely be significant on almost all of the bootstrap datasets too. So if you counted up the fraction of regressions for which each variable was "selected" by being significant, you'd get an unbalanced count across variables. This is somewhat (but only somewhat) analogous to what happens inside the random forest, only the variable selection is based on "best at each split" rather than "p-value < 0.05" or some such.
EDIT in response to a question by the OP: Note, however, that variable importance measures are not based solely on counts of how many times a variable is used in a split. Consequently, you can have "important" variables (as measured by "importance") that are used less often in splits than less "important" variables (as measured by "importance".) For example, consider the model:
$ y_i = I(x_i > c) + 0.25*z_i^2 + e_i$
as implemented and estimated by the following R code:
x <- runif(500)
z <- rnorm(500)
y <- (x>0.5) + z*z/4 + rnorm(500)
df <- data.frame(list(y=y,x=x,z=z,junk1=rnorm(500),junk2=runif(500),junk3=rnorm(500)))
foo <- randomForest(y~x+z+junk1+junk2+junk3,mtry=2,data=df)
importance(foo)
IncNodePurity
x 187.38456
z 144.92088
junk1 102.41875
junk2 93.61086
junk3 92.59587
varUsed(foo)
[1] 16916 17445 16883 16434 16453
Here $x$ has higher importance, but $z$ is used more frequently in splits; $x$'s importance is high but in some sense very local, while $z$ is more important over the full range of $z$ values.
For a fuller discussion of random forests, see Chap. 15 of The Elements..., which the link above allows you to download as a pdf for free.
Perhaps what you are looking for is the Combining Multiple Models (CMM) approach developed by Domingos in the 90s. Details for using this with a bagged ensemble of C4.5 rules is described in his ICML paper
Domingos, Pedro. "Knowledge Acquisition from Examples Via Multiple
Models." In Proceedings of the Fourteenth International Conference on
Machine Learning. 1997.
The pseudocode in Table 1 is not specific to bagged C4.5, however:
To apply this to Random Forests, the key issue seems to be how to generate the randomly generated example $\overrightarrow{x}$. Here is a notebook showing one way to do it with sklearn
.
This has got me wondering what follow-up work has been doing on CMM, and if anyone has come up with a better way to generate $\overrightarrow{x}$. I've created a new questions about it here.
Best Answer
Note that mtry is the number of variables randomly sampled as candidates at each split. And from this candidates the best is choosen to perform splitting. Thus the proportion you have mentioned is not satisfied completely. More important variables appear more frequently, and less important – less frequently. So if the variable is really very important, then there is a great probability that it will be picked in a tree and you do not need manual correction. But sometimes (rarely) there is a need to force the presence of some variable (regardless of its possible importance) in the regression. As far as I know R package random forest does not support such possibility. But if this variable has no intercorrelation with others you can make ordinary regression with this variable as single term and then run random forest regression on the residuals of this ordinary regression. If you still want to correct the possibility of choosing prespecified variables, then modification of source code with next compilation is your option.