Solved – Feature subset selection by stepwise regression for a random forest model

information criterianonparametricrandom forestregressionstepwise regression

I would like to build a random forest model for regression. I have an abundance of potential features, and I expect only some of them to have a significant impact on the target variable. In addition, some of the features may be strongly correlated.

By trial and error, I have found a subset of features which seem to work well (in terms of mean squared error on the test set), but I would like to see if I can do better. I was thinking of some sort of autmatic feature subset selection, such as stepwise regression. However, I have two questions

  1. Since random forest models already select features, is it possible to gain much by such a method?
  2. What would be a suitable information criterion for a random forest model? Since it's nonparametric, standard information criteria (such as AIC or BIC) one would normally use for stepwise regression, don't work.

Best Answer

Since random forest models already select features, is it possible to gain much by such a method?

Yes and no. By selecting only a subset of features, and creating synthetic variables, you can help/accelerate the convergence of trees. But not necessarily improve it, because synthetic variables, which are a combination of one or more variables and one or more division rules, are nothing more than nodes in a decision tree, and so the tree will find them on its own if it finds them relevant.

But the idea is good, a widely used technique is to look at the results of a simple decision tree, and use the conditions of these first nodes to create synthetic features that we reinject in a logistic regression or stepwise regression.

Example of decision tree graph

What would be a suitable information criterion for a random forest model? Since it's ninparametric, standard information criteria (such as AIC or BIC) one would normally use for stepwise regression, don't work.

The objective of each split is to find the division, or more precisely the variable and the division rule, which will contribute to the strongest decrease of the heterogeneity of the son nodes on the left $\kappa_l$ and on the right $\kappa_r$. In the case where Y is a qualitative variable, several heterogeneity functions can be defined for a node: a criterion defined from the notion of Entropy or from the Gini concentration (actually there is also the CHAID criterion which is based on the statistical test of $\chi^2$). In practice, it turns out that the choice of the criterion has little influence, and it is often Gini that is chosen by default.

In the node $\kappa$, where $p_\kappa^l$ is the number of element of the class $l$ in the node $\kappa$, and $m$ the number of classes.

\begin{align*} \textit{Entropy} : &S_\kappa = - \sum_{l=1}^{m} p_\kappa^l log(p_\kappa^l) \\ \textit{Gini} : &G_\kappa = \sum_{l=1}^{m}p_\kappa^l(1 - p_\kappa^l) = 1 - \sum_{l=1}^{m}(p_{\kappa}^{l})^2 \end{align*}

As you mentioned earlier, we cannot directly use the Akaike information criterion or the Bayesian information criterion. Nevertheless, it is possible to easily apply a backward stepwise selection.

In the case of random forests, a method for selecting variables is based on the importance score of the variables (ability of a variable to predict $Y$). We thus employ a top-down (or backward) strategy where we remove step by step the least important variables as defined in the importance criterion. At each stage of the algorithm, we calculate the prediction error. The subset finally chosen is the one that minimizes the prediction error.

The agorithm can be summed up as follows :

1. Construct a random forest and calculate the error
2. Calculate the measure of importance of each variables
3. Eliminate the least important variable
4. Repeat steps 1 to 3 until all variables have been eliminated