Solved – Why lasso for feature selection

feature selectionlassolinear modelridge regression

Suppose I have a high-dimensional dataset and want to perform feature selection. One way is to train a model capable of identifying the most important features in this dataset and use this to throw away the least important ones.

In practice I would use sklearn's SelectFromModel transformer for this. According to the documentation any estimator with either a feature_importances_ or a coef_ attribute would do.

Besides Lasso, many other linear models have this attribute (LinearRegression, Ridge and ElasticNet to name a few) and can be used for identifying the most important features.

What makes Lasso the most popular model for identifying the most important features in a dataset?

Best Answer

First, be careful in specifying what you mean by "the most important features" in a dataset. See this page for different perspectives on this issue. For example, features that are deemed "unimportant" individually might be needed to help improve predictions based on other features, so you might not want to throw them away.

What LASSO does well is to provide a principled way to reduce the number of features in a model. In contrast, automated feature selection based on standard linear regression by stepwise selection or choosing features with the lowest p-values has many drawbacks. Advantages of LASSO over other regression-based approaches are specifically described here. LASSO involves a penalty factor that determines how many features are retained; using cross-validation to choose the penalty factor helps assure that the model will generalize well to future data samples.

Ridge regression does not attempt to select features at all, it instead uses a penalty applied to the sum of the squares of all regression coefficients. Again, choice of penalty by cross-validation helps assure generalization. Elastic net can be thought of as a hybrid of LASSO with ridge. See this page for details on the differences among these penalized methods. If your main interest is in prediction and it's not too expensive to gather information about all the features, you might not need to do feature selection at all and instead use ridge regression to keep information about all the predictors in the model.

If you need to cut down on the number of predictors for practical reasons, LASSO is a good choice. But all it does is give you a useful set of selected predictors, not necessarily the most important in some general sense. When features are correlated, LASSO will choose one or the other based on its performance in the particular data sample at hand. With a different sample it could well choose a different feature from a set of correlated features. This doesn't typically affect the predictive performance of the LASSO model, but it does give pause about what is meant by "the most important features." See this page for discussion about such instability in LASSO modeling.

Related Question