Solved – how to use elastic net to select a set of features

elastic netfeature selectionlasso

I have a dataset with 500 samples and 100 features. I need to come up with a set of features. The management prefer a model with a smaller set of features. How exactly should I use elastic net to do it?

I understand elastic net is 'embedded method' for feature selection. It basically use a combination of L1 and L2 penalty to shrink the coefficients of those 'unimportant' features to 0 or near zero.

Should I run the model and look at the coefficients of each variables and kind of 'arbitrarily' select top n features based on the absolute value of the coefficients? After the set of features being selected, I guess I need to run the model again with the selected set of features. Is this method correct? Is that typically what people are doing feature selection with elastic net?

An additional question. Lasso can shrink a large number of coefficient of features to zero. Can someone comment on feature selection with lasso vs elastic net?

Thanks a lot for advice.

Best Answer

Feature importance is a complex question and cannot be solved using implicit selection alone. You have a very small number of observations and a large number of features which makes the procedure a bit more difficult because the risk of overfitting is high.

A simple method that will avoid (some degree) of overfitting is to use a method agnostic to your model to determine feature importance. A good example is the mean decrease in accuracy (MDA) (or increase in mean squared error). The more (less) accuracy (MSE) decrease (increase) the more important the feature. A simple way to implement this method is to randomly permute a feature (such that it should have no or little signal) and see how the model performs.

Feature selection should be done on the same training data as other hyperparameter tuning (in the case of elasticnet the parameters that govern the regularization loss type and amount). This ensures you (somewhat) prevent overfitting. Ideally this allows you to eliminate some features via MDA without compromising (or with improving) your score. Additionally ElasticNet's embedded feature selection will remove even more.

It may be the case, however, that your best model eliminates no features. If this is true you will have to trade score for interpretability. Note I've left out some ideas based on spectral space since I get the impression you want to know what these variables are in their originating basis.