[Math] Bayesian Interpretation for Ridge Regression and the Lasso

linear regressionmachine learningstatistics

I'm learning the book "Introduction to Statistical Learning" and in the Chapter 6 about "Linear Model Selection and Regularization", there is a small part about "Bayesian Interpretation for Ridge Regression and the Lasso" that I haven't understood the reasoning.

A Bayesian viewpoint for regression assumes that the coefficient vector $\beta$ has some prior distribution, say $p(\beta)$, where $\beta = (\beta_0, \beta_1, \dots, \beta_p)^\top$. The likelihood of the data can be written as $f(Y|X, \beta)$, where $X = (X_1, X_2, \dots, X_p)$. Multiplying the prior distribution by the likelihood gives us (up to a proportionality constant) the posterior distribution, which takes the form $$p(\beta|X, Y) \propto f(Y|X,\beta)p(\beta|X) = f(Y|X, \beta)p(\beta),$$
where the proportionality above follows from Bayes’ theorem, and the equality above follows from the assumption that X is fixed.

The authors assume the linear model: $$Y = \beta_0 + X_1\beta_1 + \dots + X_p\beta_p + \epsilon,$$ and suppose the errors are independent and drawn from a normal distribution. There is also assumption that $p(\beta) = \prod_{j=1}^p g(\beta_j)$ for some density function $g$.

Then from those points,

It turns out that ridge regression and the lasso follow naturally from two special cases of $g$:

  • If $g$ is a Gaussian distribution with mean zero and standard deviation a function of $\lambda$, then it follows that the posterior mode for $\beta$ $-$ that is, the most likely value for $\beta$ , given the data—is given by the ridge regression solution. (In fact, the ridge regression solution is also the posterior mean.)
  • If $g$ is a double-exponential (Laplace) distribution with mean zero and scale parameter a function of $\lambda$, then it follows that the posterior mode for $\beta$ is the lasso solution. (However, the lasso solution is not the posterior mean, and in fact, the posterior mean does not yield a sparse coefficient vector.)

enter image description here

I don't understand why ridge regression and the lasso follow 2 special cases of $g$ like that and why they have such distribution. Would anyone please help me explain this? Thank you so much in advance for all your help!

Best Answer

Least squares, Lasso and Rigde regression minimie the following objective functions respectively:

$\min ||y - X \beta||_2^2 $

$\min ||y - X \beta||_2^2 + \lambda ||\beta||_1 $,

$\min ||y - X \beta||_2^2 + \lambda ||\beta||_2 $,

until this point, this optimization has nothing to do with distribution(No assumption made on the distribution of y and parameter).

However, it would be preferred if we can add probability interpretation for the minimizer, this is why assume some distribution on y and parameters.

Now assume that $y|X,\beta \sim N(X \beta, \sigma I)$, then Least square minimizer is the Maximum likelihood estimator.

Further if assume $\beta \sim N(0, I)$, then rigde minimizer is the maximum a posterior probability (MAP) estimator while assume $\beta$ laplace distribution, then lasso minimizer is also the maximum a posterior probability (MAP) estimator.

In summary, we assume distribution on y and $\beta$ to give proability interpretation of the minimizer, However, these assumptions not necessarily hold in reality. Just for interpretation purpose only.

Related Question