Laplace Prior – Why is Laplace Prior Producing Sparse Solutions?

bayesianlaplace-distributionpriorregressionregularization

I was looking through the literature on regularization, and often see paragraphs that links L2 regulatization with Gaussian prior, and L1 with Laplace centered on zero.

I know how these priors look like, but I don't understand, how it translates to, for example, weights in linear model.
In L1, if I understand correctly, we expect sparse solutions, i.e. some weights will be pushed to exactly zero. And in L2 we get small weights but not zero weights.

But why does it happen?

Please comment if I need to provide more information or clarify my path of thinking.

Best Answer

The relation of Laplace distribution prior with median (or L1 norm) was found by Laplace himself, who found that using such prior you estimate median rather than mean as with Normal distribution (see Stingler, 1986 or Wikipedia). This means that regression with Laplace errors distribution estimates the median (like e.g. quantile regression), while Normal errors refer to OLS estimate.

The robust priors you asked about were described also by Tibshirani (1996) who noticed that robust Lasso regression in Bayesian setting is equivalent to using Laplace prior. Such prior for coefficients are centered around zero (with centered variables) and has wide tails - so most regression coefficients estimated using it end up being exactly zero. This is clear if you look closely at the picture below, Laplace distribution has a peak around zero (there is a greater distribution mass), while Normal distribution is more diffuse around zero, so non-zero values have greater probability mass. Other possibilities for robust priors are Cauchy or $t$- distributions.

Using such priors you are more prone to end up with many zero-valued coefficients, some moderate-sized and some large-sized (long tail), while with Normal prior you get more moderate-sized coefficients that are rather not exactly zero, but also not that far from zero.

(image source Tibshirani, 1996)

Stigler, S.M. (1986). The History of Statistics: The Measurement of Uncertainty Before 1900. Cambridge, MA: Belknap Press of Harvard University Press.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267-288.

Gelman, A., Jakulin, A., Pittau, G.M., and Su, Y.-S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 2(4), 1360-1383.

Norton, R.M. (1984). The Double Exponential Distribution: Using Calculus to Find a Maximum Likelihood Estimator. The American Statistician, 38(2): 135-136.

Best Answer

Related Solutions

Solved – Specifying a Laplace prior using a Gaussian random variable with Gamma variance

Regression – Why is Lasso Penalty Equivalent to the Double Exponential (Laplace) Prior?

Related Question