LASSO – Why is Stepwise Regression Sometimes Better Than LASSO? Understanding the Differences

lassostepwise regression

I am trying to figure out why I keep finding better results using stepwise regression than with LASSO while there are hundreds of posts and papers stating the opposite.

To explain a bit better : I got a pool of 20 variables I want to select from and about 150 other variables I am enforcing in the model.
(It is in an association study context, the 20 selectable variables being genetic markers and the rest being PCA components allowing control of the kinship between individuals.)

The 20 variables are quite correlated as shown below :
Correlation plot

I am trying to get a subset of the markers that still explains the response variable in a 'good enough' manner. For that I used two methods a forward/backward regression and a LASSO regression.

I am a bit puzzled by the results :

$\begin{array}{r|c|c}
& Stepwise & LASSO \\
\hline Number\ of\ variables\ selected & 10 & 15 \\
Correlation\ fitted\ vs.\ observed\ values & 0.849 & 0.846 \\
MSE & 323 & 330
\end{array}$

I was not expecting LASSO to absolutely select less variables than the stepwise algorithm but I did expect that the end results would be better. How can I explain those ? Are the criterion I am using not appropriate or is it because I am using fitted values and not predicted values of additional data, or is it a totally different matter ?

Note that I used the shrinked model to get the fitted values so it is not the same problem as here

Best Answer

The problem here is much larger than your choice of LASSO or stepwise regression. With only 250 cases there is no way to evaluate "a pool of 20 variables I want to select from and about 150 other variables I am enforcing in the model" (emphasis added) unless you do some type of penalization. You are almost certainly severely over-fit with the 150 enforced variables, as the extremely high correlation coefficients (at least based on my decades of experience in biologic research) suggest. Your entire model should probably only include on the order of 20 effective predictors, either 20 unpenalized predictors or a larger number that are penalized. If you insist on keeping all those 150 enforced predictors in the model then you should use ridge regression to penalize them.

The difficulty of drawing reliable conclusions from small data sets relative to the number of predictor variables is precisely why it's important to evaluate model-building approaches with tools like multiple bootstraps. Your present sample is the best estimate you have of the underlying population. Taking multiple samples of the same size from the data with replacement and repeating the entire model building process on each of hundreds of resamples is a useful way to estimate whether you would have similar success with other data samples of the same size. Any rare relationships among individuals are those most likely to be missed in a new sample from the population and thus likely to pose problems with generalization. The advantage of LASSO over stepwise selection is seen in these tests of whether the modeling process generalizes well to new samples.