Solved – Tuning alpha parameter in LASSO linear model in scikitlearn

lassoregressionscikit learn

I'm using the LASSO method, in the problem of text classification (sentiment classification). The features I'm using are mainly Ngrams (every N consecutive words) and I'm using the LASSO specifically so that I can rank the features and extract the set of the significant Ngrams in the classification problem.

My question is about tuning the alpha parameter in the scikitlearn model: I understand that as I set alpha closer to 1, the number of features selected becomes fewer. So I wanted to ask:

  1. What is the best practice to select the number of the important features, hence alpha value (cross validation could be possible if I seek maximum score not model interpretation), but is there's something to measure the "minimum adequate number of features for the classification process"?
  2. If I decided that I would like to select only the top 1000 features and set alpha to return 1000 features with non zero coefficients. Would the LASSO method here differ from using normal linear regression and rank the top 1000 features?

Best Answer

First: trying to set alpha to find a pre-specified number of important features isn't a good idea. Whether a feature is predictive of the response is a property of the data, not your model. So you want your model to tell you how many features are important, not the other way around. If you try to mess with your alpha until it finds a pre-specified number of features to be predictive, you run the risk of overfitting (if there are really fewer predictive features than that) or underfitting (if there are more).

This is why the tuning parameter is often selected automatically based on minimizing cross-validated generalization error. In the cross-validation setting, people frequently do something similar to finding the "minimum adequate number of features", which is to select the largest alpha where the error is at most one standard deviation above the alpha parameter with the lowest cross-validated error (e.g. here, p. 18). The rationale for this is that there's some noise in the cross-validated error estimate, and if you select the alpha that simply minimizes the estimate you risk overfitting to the noise, so it's better to "err on the side of parsimony," as the paper puts it.

On the other hand, some papers (e.g. here) have noted that selecting alpha by minimizing cross-validated error does not yield consistent feature selection in practice (i.e., where features are selected if and only if they should be). An alternative is selecting based on the BIC, as advocated by e.g. Zou, Hastie and Tibshirani here. (For the BIC one should set the "degrees of freedom" equal to the rank of the feature matrix for the features found to be nonzero; see the paper for more detail.)

Related Question