Solved – Elastic/ridge/lasso analysis, what then

elastic nethypothesis testinginferencelassoprediction

I'm getting really interested in the elastic net procedure for predictor shrinkage/selection. It seems very powerful.

But from the scientific point of view I don't know well what to do once I got the coefficients. What question am I answering? These are the variables that most influence that outcome and these are the coefficients which give the best variance/bias ratio during validation?

This is of course a very descriptive/predictive approach compared to the classical p value/confidence intervals approach. Inferential estimation is being studied now by Tibshirani & Co. but is still experimental.

Some people are using the variables chosen by elastic net to perform classical inferential analysis, but that would eliminate the limitation in variance brought by the technique.

Another problem is that since lambda and alpha parameters for elastic net are chosen by cross validation they are subject to random variability. So every time you run (eg.) cv.glmnet() you will select a slightly different subset of predictors with always different coefficients.

I though about solving this considering the right lambda and alpha as random variables and re run the cross validation step n times to get a distribution of these parameters.
This way for every predictor I would have the number of occurrences and for every coefficients I would have distribution of results.
This should give me more generalizable results with ranges statistics (like sd of the coefficients).
It would also be interesting to see whether the lambda and the alpha picked this way approximate to some distribution asymptotically, since that would open up the way for some inference test (but I'm not a statistician so I should not speak about things I don't fully understand).

So finally my question is: Once you get the predictors and the coefficients from an elastic net with cross validation based alpha and lambda, which and how should you present these results? How should you discuss them? what did we learn? Which hypothesis/generalization are we confuting?

Best Answer

These methods--the lasso and elastic net--were born out of the problems of both feature selection and prediction. It's through these two lenses that I think an explanation can be found.

Matthew Gunn nicely explains in his reply that these two goals are distinct and often taken up by different people. However, fortunately for us, the methods we're interested in can perform well in both arenas.

Feature Selection

First, let's talk about feature selection. We should first motivate the elastic net from the perspective of the lasso. That is, to quote Hastie and Zou, "If there is a group of variables among which the pairwise correlations are very high, then the lasso tends to select only one variable from the group and does not care which one is selected." This is a problem, for instance, because it means that we're not likely to find an element of the true support using the lasso--just one highly correlated with it. (The paper mentions that this is proven in the LARS paper, which I haven't read yet.) The difficulty of support recovery in the presence of correlation is also pointed out by Wainwright, who showed (in theorem 2a) that the probability of support recovery is bounded above by $0.5$ when there's high correlation between the true support and it's complement.

Now, the l2 penalty in the elastic net encourages features which have coefficients treated as indistinguishable by just the loss and l1 penalty to have equal estimated coefficient. We can loosely see this by noticing that $(a,b) = \arg\min_{a',b': c = |a'| + |b'|} (a')^2 + (b')^2$ satisfies $|a| = |b|$. Due to this, the elastic net makes it so that we're less likely to 'accidentally' make vanish a coefficient estimate which is in the true support. That is, the true support is more likely to be contained within the estimated support. That's good! It does mean there's more false discoveries, but that's a price that most people are willing to pay.

As an aside, it's worth pointing out that the fact that highly correlated features will tend to have very similar coefficient estimates makes it so that we can detect groupings of features within the estimated support which influence the response similarly.

Prediction

Now, we move on to prediction. As Matthew Gunn points out, choosing tuning parameters through cross validation creates an aim to choose a model with minimal prediction error. Since any model selected by the lasso can be selected by the elastic net (by taking $\alpha = 1$), it makes some sense that the elastic net is able to find a model that predicts better than the lasso.

Lederer, Yu, and Gaynanova show, under no assumptions whatsoever on the features, that the lasso and elastic net can both have their l2 prediction error bounded by the same quantity. It's not necessarily true that their bound is tight, but this might be interesting to note since oracle inequalities seem to be a standard way in statistical literature to quantify the predictive performance of estimators--perhaps since the distributions are so complicated! It's also worth noting that Lederer (1)(2) has some papers on lasso predictions in the presence of correlated features.

Summary

In summary, the problems of interest are the true support being within the estimated support and prediction. For support recovery, there's rigorously proven guarantees (through Wainwright) that the lasso selects the correct features to be in the model under assumptions of low correlation between the true support and it's complement. However, in the presence of correlation, we can fall back to the elastic net to be more likely to select the features in the true support to be among all that it selects. (Note that we have to carefully select the tuning parameters here.) And, for prediction when we choose the tuning parameter through cross validation, it makes intuitive sense that the elastic net should perform better than the lasso--especially in the presence of correlation.

Putting aside prediction and some formality, what did we learn? We learned about the true support.

Confidence Intervals

It's worth pointing out that a lot has changed in the past 2 years in regards to valid inference for the lasso. In particular, the work by Lee, Sun, Sun, and Taylor provides exact inference for the coefficients of the lasso conditional on the given model being selected. (Results on inference in lasso for the true coefficients was around at the time of OP's post, and they're well summarized in the linked paper.)

Related Question