I'm using the LASSO method, in the problem of text classification (sentiment classification). The features I'm using are mainly Ngrams (every N consecutive words) and I'm using the LASSO specifically so that I can rank the features and extract the set of the significant Ngrams in the classification problem.
My question is about tuning the alpha
parameter in the scikitlearn model: I understand that as I set alpha
closer to 1, the number of features selected becomes fewer. So I wanted to ask:
- What is the best practice to select the number of the important features, hence alpha value (cross validation could be possible if I seek maximum score not model interpretation), but is there's something to measure the "minimum adequate number of features for the classification process"?
- If I decided that I would like to select only the top 1000 features and set alpha to return 1000 features with non zero coefficients. Would the LASSO method here differ from using normal linear regression and rank the top 1000 features?
Best Answer
First: trying to set
alpha
to find a pre-specified number of important features isn't a good idea. Whether a feature is predictive of the response is a property of the data, not your model. So you want your model to tell you how many features are important, not the other way around. If you try to mess with youralpha
until it finds a pre-specified number of features to be predictive, you run the risk of overfitting (if there are really fewer predictive features than that) or underfitting (if there are more).This is why the tuning parameter is often selected automatically based on minimizing cross-validated generalization error. In the cross-validation setting, people frequently do something similar to finding the "minimum adequate number of features", which is to select the largest
alpha
where the error is at most one standard deviation above thealpha
parameter with the lowest cross-validated error (e.g. here, p. 18). The rationale for this is that there's some noise in the cross-validated error estimate, and if you select thealpha
that simply minimizes the estimate you risk overfitting to the noise, so it's better to "err on the side of parsimony," as the paper puts it.On the other hand, some papers (e.g. here) have noted that selecting
alpha
by minimizing cross-validated error does not yield consistent feature selection in practice (i.e., where features are selected if and only if they should be). An alternative is selecting based on the BIC, as advocated by e.g. Zou, Hastie and Tibshirani here. (For the BIC one should set the "degrees of freedom" equal to the rank of the feature matrix for the features found to be nonzero; see the paper for more detail.)