Solved – If the LASSO is equivalent to linear regression with a Laplace prior how can there be mass on sets with components at zero

laplace-distributionlasso

We are all familiar with the notion, well documented in the literature, that LASSO optimization (for sake of simplicity confine attention here to the case of linear regression)
$$
{\rm loss} = \| y – X \beta \|_2^2 + \lambda \| \beta \|_1
$$
is equivalent to the linear model with Gaussian errors in which the parameters are given the Laplace prior
$$
\exp(-\lambda \| \beta \|_1 )
$$
We are also aware that the higher one sets the tuning parameter, $\lambda $, the larger the portion of parameters get set to zero. This being said, I have the following thought question:

Consider that from the Bayesian point of view we can calculate the posterior probability that, say, the non-zero parameter estimates lie in any given collection of intervals and the parameters set to zero by the LASSO are equal to zero. What has me confused is, given that the Laplace prior is continuous (in fact absolutely continuous) then how can there be any mass on any set that is a product of intervals and singletons at $\{0\}$?

Best Answer

Like all the comments above, the Bayesian interpretation of LASSO is not taking the expected value of the posterior distribution, which is what you would want to do if you were a purist. If that would be the case, then you would be right that there is very small chance that the posterior would be zero given the data.

In reality, the Bayesian interpretation of LASSO is taking the MAP (Maximum A Posteriori) estimator of the posterior. It sounds like you are familiar, but for anyone who is not, this is basically Bayesian Maximum Likelihood, where you use the value that corresponds to the maximum probability of occurrence (or the mode) as your estimator for the parameters in LASSO. Since the distribution increases exponentially until zero from the negative direction and falls off exponentially in the positive direction, unless your data strongly suggests the beta is some other significant value, the maximum value of value of your posterior is likely to be 0.

Long story short, your intuition seems to be based on the mean of the posterior, but the Bayesian interpretation of LASSO is based on taking the mode of the posterior.

Related Question