Solved – What does it mean that stepwise, backward and forward selection methods are “path dependent”

feature selectionintuitionoptimizationregressionstepwise regression

In many papers I read that stepwise, backward and forward selection methods are "path dependent". What does it mean? Could anyone give me some practical example to understand the underlying concept? It is related to the fact that these methods are local search techniques?

Best Answer

Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model.

In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model.

In detail:

  • Let $\mathcal{M}_0$ denote the null model, which contains no predictors.
  • For $k = 0, \dots, p - 1:$ Consider all $p - k$ models that augment the predictors in $\mathcal{M}_k$ with one additional predictor. Choose the best among these $p - k$ models, and call it $\mathcal{M}_{k + 1}$. Here best is defined as having the smallest $RSS$, or equivalently largest $R^2$.
  • Select a single best model from among $\mathcal{M}_0, \dots, \mathcal{M}_p$ using cross-validated prediction error, $AIC$, $BIC$, or adjusted $R^2$.

It is not guaranteed to find the best possible model out of all $2^p$ models containing subsets of the $p$ predictors. This is because the variable that has been selected in round 1 (i.e., for $k=0$) will definitely be in the final model, even if it occurred at a later stage that, when considering the best $k$-dimensional model, $k>1$, does not need that particular variable.

In slightly different words, it might happen that a variable gives the largest reduction in $RSS$ relative the null model, but that, when also considering combinations of, say, three variables, that some combination of three predictors makes the variable chosen in round 1 redundant.