Solved – What are the advantages of stepwise regression

feature selectionregressionstepwise regression

I am experimenting with stepwise regression for the sake of diversity in my approach to the problem. So, I have 2 questions:

  1. What are the advantages of stepwise regression? What are its specific strengths?

  2. What do you think about hybrid approach, where you use stepwise regression to select features, and then apply regular regression taking all the selected features together?

Best Answer

The primary advantage of stepwise regression is that it's computationally efficient. However, its performance is generally worse than alternative methods. The problem is that it's too greedy. By making a hard selection on the the next regressor and 'freezing' the weight, it makes choices that are locally optimal at each step, but suboptimal in general. And, it can't go back to revise its past choices.

As far as I'm aware, stepwise regression has generally fallen out of favor compared to $l_1$ regularized regression (LASSO), which tends to produce better solutions.

Tibshirani (1996). Regression Shrinkage and Selection via the Lasso

LASSO penalizes the $l_1$ norm of the weights, which induces sparsity in the solution (many weights are forced to zero). This performs variable selection (the 'relevant' variables are allowed to have nonzero weights). The degree of sparsity is controlled by the penality term, and some procedure must be used to select it (cross-validation is a common choice). LASSO is more computationally intensive than stepwise regression, but a number of efficient algorithms exist. Some examples are least angle regression (LARS), and an approach based on coordinate descent.

A similar approach to what you suggested in (2) is called orthogonal matching pursuit. It's a generalization of matching pursuit, which is the name for stepwise regression in the signal processing literature.

Pati et al. (1993). Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition

On each iteration, the next best regressor is added to the active set. Then, the weights for all regressors in the active set are recomputed. Because of the reweighting step, this approach is less greedy (and has better performance) than the regular matching pursuit/stepwise regression. But, it still employs a greedy search heuristic.

All of these approaches (stepwise regression, LASSO, and orthogonal matching pursuit) can be thought of as approximations of the following problem:

$$\underset{w}{\min} \| y - X w \|_2^2 \quad \text{s.t. } \|w\|_0 \le c$$

In a regression context, columns of $X$ correspond to the independent variables and $y$ to the dependent variable. In signal processing, columns of $X$ correspond to basis functions and $y$ is a signal to approximate. The goal is to find a sparse set of weights $w$ that give the best (least squares) approximation of $y$. The $l_0$ norm simply counts the number of non-zero entries in $w$. Unfortunately, this problem is NP-hard, so approximation algorithms must be used in practice. Stepwise regression and orthogonal matching pursuit attempt to solve the problem using a greedy search strategy. LASSO reformulates the problem using a relaxation of the $l_0$ norm to the $l_1$ norm. Here, the optimization problem becomes convex (and thus tractable). And, although the problem is no longer identical, the solution is similar. If I recall correctly, both LASSO and orthogonal matching pursuit have been proved to recover the exact solution under certain conditions.