Solved – What does it mean that stepwise, backward and forward selection methods are “path dependent”

feature selectionintuitionoptimizationregressionstepwise regression

In many papers I read that stepwise, backward and forward selection methods are "path dependent". What does it mean? Could anyone give me some practical example to understand the underlying concept? It is related to the fact that these methods are local search techniques?

Best Answer

Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model.

In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model.

In detail:

Let $\mathcal{M}_0$ denote the null model, which contains no predictors.
For $k = 0, \dots, p - 1:$ Consider all $p - k$ models that augment the predictors in $\mathcal{M}_k$ with one additional predictor. Choose the best among these $p - k$ models, and call it $\mathcal{M}_{k + 1}$. Here best is defined as having the smallest $RSS$, or equivalently largest $R^2$.
Select a single best model from among $\mathcal{M}_0, \dots, \mathcal{M}_p$ using cross-validated prediction error, $AIC$, $BIC$, or adjusted $R^2$.

It is not guaranteed to find the best possible model out of all $2^p$ models containing subsets of the $p$ predictors. This is because the variable that has been selected in round 1 (i.e., for $k=0$) will definitely be in the final model, even if it occurred at a later stage that, when considering the best $k$-dimensional model, $k>1$, does not need that particular variable.

In slightly different words, it might happen that a variable gives the largest reduction in $RSS$ relative the null model, but that, when also considering combinations of, say, three variables, that some combination of three predictors makes the variable chosen in round 1 redundant.

Related Solutions

Feature Selection – Starting Model for Forward-Backward Model Selection in R

I believe "forward-backward" selection is another name for "forward-stepwise" selection. This is the default approach used by stepAIC. In this procedure, you start with an empty model and build up sequentially just like in forward selection. The only caveat is that every time you add a new variable, $X_{new}$, you have to check to see if any of the other variables that are already in the model should be dropped after $X_{new}$ is included. In this approach, you can end up searching "nonlinearly" through all the different models.

-------- EDIT --------

The following R code illustrates the difference between the three selection strategies:

# library(MASS)
set.seed(1)

N <- 200000
y <- rnorm(N)
x1 <- y + rnorm(N)
x2 <- y + rnorm(N)
x3 <- y + rnorm(N)
x4 <- rnorm(N)
x5 <- rnorm(N)
x6 <- x1 + x2 + x3 + rnorm(N)
data <- data.frame(y, x1, x2, x3, x4, x5, x6)

fit1 <- lm(y ~ ., data)
fit2 <- lm(y ~ 1, data)
stepAIC(fit1,direction="backward")
stepAIC(fit2,direction="forward",scope=list(upper=fit1,lower=fit2))
stepAIC(fit2,direction="both",scope=list(upper=fit1,lower=fit2))

I've modified your example just slightly in this code. First, I set a seed so that you can see the same data I used. I also made N smaller so the algorithm runs a little faster. I kept all your variables the same except for x6. x6 is now the most predictive of y individually - this will make it the first variable chosen in forward and forward-stepwise selection. But once x1, x2 and x3 enter the model, x6 becomes independent of y and should be excluded. You'll see that forward-stepwise does exactly this. It starts with x6, proceeds to include x1, x2 and x3, then it goes back and drops x6 and terminates. If you just use forward, then x6 will stay in the model because the algorithm never goes back to this sort of multicollinearity check.

Solved – Forward or backward sequential feature selection

The facts that you are getting different answers from forward and backward selection, and that you get different answers when you change the seed, should give you pause. Clearly, these can't all be right. Most likely, none of them are. The simplest answer is that you should not use these methods at all. Here are some threads you might want to read:

In place of these methods, you might stop and ask why you need to select variables at all. Twenty-two variables with 5,000 data should present no real problems for most things. You could also read this:

What are modern, easily used alternatives to stepwise regression?

Best Answer

Related Solutions

Feature Selection – Starting Model for Forward-Backward Model Selection in R

Solved – Forward or backward sequential feature selection

Related Question