The part of the overall random forest algorithm that uses mtry
is (adapted from The Elements of Statistical Learning):
At each terminal node that is larger than minimal size,
1) Select mtry variables at random from the $p$ regressor variables,
2) From these mtry variables, pick the best variable and split point,
3) Split the node into two daughter nodes using the chosen variable and split point.
As an aside - you can use the tuneRF
function in the randomForest
package to select the "optimal" mtry
for you, using the out-of-bag error estimate as the criterion.
The random selection of variables at each node splitting step is what makes it a random forest, as opposed to just a bagged estimator. Quoting from The Elements of Statistical Learning, p 588 in the second edition:
The idea in random forests ... is to improve the variance reduction of bagging by reducing the correlation between the trees, without increasing the variance too much. This is achieved in the tree-growing process through random selection of the input variables.
There is no incremental increase in bias due to this. Of course, if the model itself is fundamentally biased, e.g., by leaving out important predictor variables, using random forests won't make the situation any better, but it won't make it worse either.
The unbalanced use of predictor variables just reflects the fact that some are less important than others, where important is used in a heuristic rather than a formal sense, and as a consequence, for some trees, may not be used often or at all. For example, think about what would happen if you had a variable that was barely significant on the full data set, but you then generated a lot of bootstrap datasets from the full data set and ran the regression again on each bootstrap dataset. You can imagine that the variable would be insignificant on a lot of those bootstrap datasets. Now compare to a variable that was extremely highly significant on the full dataset; it would likely be significant on almost all of the bootstrap datasets too. So if you counted up the fraction of regressions for which each variable was "selected" by being significant, you'd get an unbalanced count across variables. This is somewhat (but only somewhat) analogous to what happens inside the random forest, only the variable selection is based on "best at each split" rather than "p-value < 0.05" or some such.
EDIT in response to a question by the OP: Note, however, that variable importance measures are not based solely on counts of how many times a variable is used in a split. Consequently, you can have "important" variables (as measured by "importance") that are used less often in splits than less "important" variables (as measured by "importance".) For example, consider the model:
$ y_i = I(x_i > c) + 0.25*z_i^2 + e_i$
as implemented and estimated by the following R code:
x <- runif(500)
z <- rnorm(500)
y <- (x>0.5) + z*z/4 + rnorm(500)
df <- data.frame(list(y=y,x=x,z=z,junk1=rnorm(500),junk2=runif(500),junk3=rnorm(500)))
foo <- randomForest(y~x+z+junk1+junk2+junk3,mtry=2,data=df)
importance(foo)
IncNodePurity
x 187.38456
z 144.92088
junk1 102.41875
junk2 93.61086
junk3 92.59587
varUsed(foo)
[1] 16916 17445 16883 16434 16453
Here $x$ has higher importance, but $z$ is used more frequently in splits; $x$'s importance is high but in some sense very local, while $z$ is more important over the full range of $z$ values.
For a fuller discussion of random forests, see Chap. 15 of The Elements..., which the link above allows you to download as a pdf for free.
When working on "feature importance" generally it is helpful to remember that in most cases a regularisation approach is often a good alternative. It will automatically "select the most important features" for the problem at hand.
Now, if we do not want to follow the notion for regularisation (usually within the context of regression), random forest classifiers and the notion of permutation tests naturally lend a solution to feature importance of group of variables. This has actually been asked before here: "Relative importance of a set of predictors in a random forests classification in R" a few years back. More rigorous approaches like Gregorutti et al.'s : "Grouped variable importance with random forests and
application to multivariate functional data analysis". Chakraborty & Pal's Selecting Useful Groups of Features in a Connectionist Framework looks into this task within the context of an Multi-Layer Perceptron. Going back to the Gregorutti et al. paper their methodology is directly applicable to any kind of classification/regression algorithm. In short, we use a randomly permuted version in each out-of-bags sample that is used during training.
Having stated the above, while permutation tests are ultimately a heuristic, what has been solved accurately in the past is the penalisation of dummy variables within the context of regularised regression. The answer to that question is Group-LASSO, Group-LARS and Group-Garotte. Seminal papers in that work are Yuan and Lin's: "Model selection and estimation in regression with grouped variables" (2006) and Meier et al.'s: "The group lasso for logistic regression" (2008). This methodology allows us to work in situation where: "each factor may have several levels and can be expressed through a group of dummy variables" (Y&L 2006). The effect is such that "the group lasso encourages sparsity at the factor level." (Y&L 2006). Without going to excessive details the basic idea is that the standard $l_1$ penalty is replaced by the norm of positive definite matrices $K_{j}$, $j = \{1, \dots, J\}$ where $J$ is the number of groups we examine. CV has a few good threads regarding Group-Lasso here, here and here if you want to pursue this further.
[Because we mention Python specifically: I have not used the Python's pyglmnet
package but it appears to include grouped lasso regularisation.]
All in all, in does not make sense to simply "add up" variable importance from individual dummy variables because it would not capture association between them as well as lead to potentially meaningless results. That said, both group-penalised methods as well as permutation variable importance methods give a coherent and (especially in the case of permutation importance procedures) generally applicable framework to do so.
Finally to state the obvious: do not bin continuous data. It is bad practice, there is an excellent thread on this matter here (and here). The fact that we observe spurious results after the discretization of continuous variable, like age
, is not surprising. Frank Harrell has also written extensivel on problems caused by categorizing continuous variables.
Best Answer
Random forests for classification might use two kind of variable importance. See the original description of the RF here.
This is plain wrong. The gini impurity is build using only the proportions of the target/dependent variable, when split by a test which involves either numerical or nominal independent variable. Note that the independent variable plays a role only for building the split test, the computation of gini index is based only on counts on dependent variable after split. Of course, the gini impurity index on each node is used further to compute gini importance.
I do not know if that would count, but my personal experiments revealed taht there are no big differences between gini variable importance and permutation value importance. And I usually prefered the former.
The second problem is the unbalance of the samples labeled with 1 and 0. I think this might play a role on variable importance, but to be honest I would verify if this is the case. Thus I would repeat many times various computations of variable importance with various sampels having different proportions varying gradually from 0.5 ratio to the actual ration. I expect that finding a stable variable importance no matter proportion to not be so unexpected.
[later edit]
It took me some time to compile the document provided by @Donbeo. I agree with the results from that paper and I hope that I would further experiment myself with that. The only think which I do not like about that study is that it does not state which number of trees were used and what would imply the variation of this parameter. The single note regarding that is that the number of trees affects the scaled version for permutation tests.