Solved – What are the advantages of stepwise regression

feature selectionregressionstepwise regression

I am experimenting with stepwise regression for the sake of diversity in my approach to the problem. So, I have 2 questions:

What are the advantages of stepwise regression? What are its specific strengths?
What do you think about hybrid approach, where you use stepwise regression to select features, and then apply regular regression taking all the selected features together?

Best Answer

The primary advantage of stepwise regression is that it's computationally efficient. However, its performance is generally worse than alternative methods. The problem is that it's too greedy. By making a hard selection on the the next regressor and 'freezing' the weight, it makes choices that are locally optimal at each step, but suboptimal in general. And, it can't go back to revise its past choices.

As far as I'm aware, stepwise regression has generally fallen out of favor compared to $l_1$ regularized regression (LASSO), which tends to produce better solutions.

Tibshirani (1996). Regression Shrinkage and Selection via the Lasso

LASSO penalizes the $l_1$ norm of the weights, which induces sparsity in the solution (many weights are forced to zero). This performs variable selection (the 'relevant' variables are allowed to have nonzero weights). The degree of sparsity is controlled by the penality term, and some procedure must be used to select it (cross-validation is a common choice). LASSO is more computationally intensive than stepwise regression, but a number of efficient algorithms exist. Some examples are least angle regression (LARS), and an approach based on coordinate descent.

A similar approach to what you suggested in (2) is called orthogonal matching pursuit. It's a generalization of matching pursuit, which is the name for stepwise regression in the signal processing literature.

Pati et al. (1993). Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition

On each iteration, the next best regressor is added to the active set. Then, the weights for all regressors in the active set are recomputed. Because of the reweighting step, this approach is less greedy (and has better performance) than the regular matching pursuit/stepwise regression. But, it still employs a greedy search heuristic.

All of these approaches (stepwise regression, LASSO, and orthogonal matching pursuit) can be thought of as approximations of the following problem:

$$\underset{w}{\min} \| y - X w \|_2^2 \quad \text{s.t. } \|w\|_0 \le c$$

In a regression context, columns of $X$ correspond to the independent variables and $y$ to the dependent variable. In signal processing, columns of $X$ correspond to basis functions and $y$ is a signal to approximate. The goal is to find a sparse set of weights $w$ that give the best (least squares) approximation of $y$. The $l_0$ norm simply counts the number of non-zero entries in $w$. Unfortunately, this problem is NP-hard, so approximation algorithms must be used in practice. Stepwise regression and orthogonal matching pursuit attempt to solve the problem using a greedy search strategy. LASSO reformulates the problem using a relaxation of the $l_0$ norm to the $l_1$ norm. Here, the optimization problem becomes convex (and thus tractable). And, although the problem is no longer identical, the solution is similar. If I recall correctly, both LASSO and orthogonal matching pursuit have been proved to recover the exact solution under certain conditions.

Related Solutions

Solved – Sane stepwise regression

I would not recommend you use that procedure. My recommendation is: Abandon this project. Just give up and walk away. You have no hope of this working.

_{source for image}

Setting aside the standard problems with stepwise selection (cf., here), in your case you are very likely to have perfect predictions due to separation in such a high-dimensional space.

I don't have specifics on your situation, but you state that you have "only a few 10s of samples". Let's be charitable and say you have 90. You further say you have "several thousand features". Let's imagine that you 'only' have 2,000. For the sake of simplicity, let's say that all your features are binary. You "believe that the class label can be accurately predicted using only a few features", let's say that you will look for sets of up to only 9 features max. Lastly, let's imagine that the relationship is deterministic, so that the true relationship will always be perfectly present in your data. (We can change these numbers and assumptions, but that should only make the problem worse.) Now, how well would you be able to recover that relationship under these (generous) conditions? That is, how often would the correct set be the only set that yields perfect accuracy? Or, put another way, how many sets of nine features will also fit by chance alone?

Some (overly) simple math and simulations should provide some clues to this question. First, with 9 variables, each of which could be 0 or 1, the number of patterns an observation could show are $2^9 = 512$, but you will have only 90 observations. Thus it is entirely possible that, for a given set of 9 binary variables, every observation has a different set of predictor values—there are no replicates. Without replicates with the same predictor values where some have y=0 and some y=1, you will have complete separation and perfect prediction of every observation will be possible.

Below, I have a simulation (coded in R) to see how often you might have no patterns of x-values with both 0s and 1s. The way it works is that I get a set of numbers from 1 to 512, which represent the possible patterns, and see if any of the patterns in the first 45 (that might be the 0s) match any of the pattern in the second 45 (that might be the 1s). This assumes that you have perfectly balanced response data, which gives you the best possible protection against this problem. Note that having some replicated x-vectors with differing y-values doesn't really get you out of the woods, it just means you wouldn't be able to perfectly predict every single observation in your dataset, which is the very stringent standard I'm using here.

set.seed(7938)  # this makes the simulation exactly reproducible
my.fun = function(){
  x = sample.int(512, size=90, replace=TRUE)
  return(sum(x[1:45]%in%x[46:90])==0)
}
n.unique = replicate(10000, my.fun())
mean(n.unique)  # [1] 0.0181

The simulation suggests you would have this issue with approximately 1.8% of the sets of 9 x-variables. Now, how many sets of 9 are there? Strictly, that would be $1991 \text{ choose } 9 = 1.3\times 10^{24}$ (since we've stipulated that the true 9 deterministic causal variables are in your set). However, many of those sets will be overlapping; there will be $1991 / 9 \approx 221$ non-overlapping sets of 9 within a specified partition of your variables (with many such partitions possible). Thus, within some given partition, we might expect there would be $221\times 0.018\approx 4$ sets of 9 x-variables that will perfectly predict every observation in your dataset.

Note that these results are only for cases where you have a relatively larger dataset (within the "tens"), a relatively smaller number of variables (within the "thousands"), only looks for cases where every single observation can be predicted perfectly (there will be many more sets that are nearly perfect), etc. Your actual case is unlikely to work out 'this well'. Moreover, we stipulated that the relationship is perfectly deterministic. What would happen if there is some random noise in the relationship? In that case, you will still have ~4 (null) sets that perfectly predict your data, but the right set may well not be among them.

Tl;dr, the basic point here is that your set of variables is way too large / high dimensional, and your amount of data is way too small, for anything to be possible. If it's really true that you have "tens" of samples, "thousands" of variables, and absolutely no earthly idea which variables might be right, you have no hope of getting anywhere with any procedure. Go do something else with your time.

Solved – Linear model predictor selection. Which method to use

This is a rather broad question.

First, I do not think Ridge regression shrinks coefficient to 0. It does not create sparsity so if you want to do feature selection it will be useless. You should consider the lasso instead or the elasticnet (which is a mix of ridge and lasso since a penalty L1 and a L2 one are added to the minimisation problem).

If your goal is really to select variables, have a look at the stability selection from Meinshausen and Bulhmann. The concept is to bootstrap and do Lasso regression. It uses the fact that there is a homotopic solution (meaning each coefficient in Lasso regression has a piecewise continuous solution path). Starting with a very high penalty and decreasing it step by step you have each coefficient being not null one by one. Now if you do that several times you can have a probability of not being null for each coefficient (meaning variable selected or not) for each penalty value.

This would be a good method if you have a lot of variables because Lasso can be seen as a convex relaxation of subset selection. So it is usually faster.

Dimension reduction (PCA for example) may not be designed to get a better performance accuracy because it is often unsupervised. See http://metaoptimize.com/qa/questions/9338/how-to-use-pca-for-classification for a more precise discussion on that subject.

Best Answer

Related Solutions

Solved – Sane stepwise regression

Solved – Linear model predictor selection. Which method to use

Related Question