Feature Selection – Starting Model for Forward-Backward Model Selection in R

aicfeature selectionforward-backwardmodelingr

I am trying to understand the logic behind forward-backward selection (even though I know that there are better methods for model selection). In forward model selection, the selection process is started with an empty model and variables are added sequentially. In backward selection, the selection process is started with the full model and variables are excluded sequentially.

Question: With which model does forward-backward selection start?

Is it the full model? The empty model? Something in between? Wikipedia and Hastie et al. (2009) – The Elements of Statistical Learning, page 60 are explaining the method, but I wasn't able to find anything about the starting model. For my analysis I am using the function stepAIC of the R package MASS.

UPDATE:

Below you can find an example in R. The stepAIC function automatically prints each step of the selection process in the console and it seems like the selection starts with the full model. However, based on the answer of jjet I am not sure if I have done anything wrong.

# Example data
N <- 1000000
y <- rnorm(N)
x1 <- y + rnorm(N)
x2 <- y + rnorm(N)
x3 <- y + rnorm(N)
x4 <- rnorm(N)
x5 <- rnorm(N)
x6 <- rnorm(N)
data <- data.frame(y, x1, x2, x3, x4, x5, x6)

# Selection
library("MASS")
mod <- lm(y ~., data)
stepAIC(mod, direction = "both")

Best Answer

I believe "forward-backward" selection is another name for "forward-stepwise" selection. This is the default approach used by stepAIC. In this procedure, you start with an empty model and build up sequentially just like in forward selection. The only caveat is that every time you add a new variable, $X_{new}$, you have to check to see if any of the other variables that are already in the model should be dropped after $X_{new}$ is included. In this approach, you can end up searching "nonlinearly" through all the different models.

-------- EDIT --------

The following R code illustrates the difference between the three selection strategies:

# library(MASS)
set.seed(1)

N <- 200000
y <- rnorm(N)
x1 <- y + rnorm(N)
x2 <- y + rnorm(N)
x3 <- y + rnorm(N)
x4 <- rnorm(N)
x5 <- rnorm(N)
x6 <- x1 + x2 + x3 + rnorm(N)
data <- data.frame(y, x1, x2, x3, x4, x5, x6)

fit1 <- lm(y ~ ., data)
fit2 <- lm(y ~ 1, data)
stepAIC(fit1,direction="backward")
stepAIC(fit2,direction="forward",scope=list(upper=fit1,lower=fit2))
stepAIC(fit2,direction="both",scope=list(upper=fit1,lower=fit2))

I've modified your example just slightly in this code. First, I set a seed so that you can see the same data I used. I also made N smaller so the algorithm runs a little faster. I kept all your variables the same except for x6. x6 is now the most predictive of y individually - this will make it the first variable chosen in forward and forward-stepwise selection. But once x1, x2 and x3 enter the model, x6 becomes independent of y and should be excluded. You'll see that forward-stepwise does exactly this. It starts with x6, proceeds to include x1, x2 and x3, then it goes back and drops x6 and terminates. If you just use forward, then x6 will stay in the model because the algorithm never goes back to this sort of multicollinearity check.

Related Question