Solved – How does LASSO select among collinear predictors

feature selectionlasso

I'm looking for an intuitive answer why a GLM LASSO model selects a specific predictor out of a group of highly correlated ones, and why it does so differently then the best subset feature selection.

From the geometry of the LASSO shown in Fig 2 in Tibshirani 1996 I'm led to believe that LASSO selects the predictor with the greater variance.

Now suppose that I use best subset selection with 10 fold CV, to obtain 2 predictors for a logistic regression model and I have reasonable prior knowledge that these 2 predictors are optimal (in 0-1 loss sense).

The LASSO solution favors a less parsimonious (5 predictors) solution with greater prediction error. Intuitively, what causes the difference to arise? Is it because of the way LASSO selects among correlated predictors?

Best Answer

LASSO differs from best-subset selection in terms of penalization and path dependence.

In best-subset selection, presumably CV was used to identify that 2 predictors gave the best performance. During CV, full-magnitude regression coefficients without penalization would have been used for evaluating how many variables to include. Once the decision was made to use 2 predictors, then all combinations of 2 predictors would be compared on the full data set, in parallel, to find the 2 for the final model. Those 2 final predictors would be given their full-magnitude regression coefficients, without penalization, as if they had been the only choices all along.

You can think of LASSO as starting with a large penalty on the sum of the magnitudes of the regression coefficients, with the penalty gradually relaxed. The result is that variables enter one at a time, with a decision made at each point during the relaxation whether it's more valuable to increase the coefficients of the variables already in the model, or to add another variable. But when you get, say, to a 2-variable model, the regression coefficients allowed by LASSO will be lower in magnitude than those same variables would have in the standard non-penalized regressions used to compare 2-variable and 3-variable models in best-subset selection.

This can be thought of as making it easier for new variables to enter in LASSO than in best-subset selection. Heuristically, LASSO trades off potentially lower-than-actual regression coefficients against the uncertainty in how many variables should be included. This would tend to include more variables in a LASSO model, and potentially worse performance for LASSO if you knew for sure that only 2 variables needed to be included. But if you already knew how many predictor variables should be included in the correct model, you probably wouldn't be using LASSO.

Nothing so far has depended on collinearity, which leads different types of arbitrariness in variable selection in best-subset versus LASSO. In this example, best-subset examined all possible combinations of 2 predictors and chose the best among those combinations. So the best 2 for that particular data sample win.

LASSO, with its path dependence in adding one variable at a time, means that an early choice of one variable may influence when other variables correlated to it enter later in the relaxation process. It's also possible for a variable to enter early and then for its LASSO coefficient to drop as other correlated variables enter.

In practice, the choice among correlated predictors in final models with either method is highly sample dependent, as can be checked by repeating these model-building processes on bootstrap samples of the same data. If there aren't too many predictors, and your primary interest is in prediction on new data sets, ridge regression, which tends to keep all predictors, may be a better choice.

Related Question