[Math] Prove the estimator $\hat{B}$ of ridge regression = mean of the posterior distribution under a Gaussian prior

bayesianlinear algebraregressionregularization

I want to prove that the estimator of ridge regression is the mean of the posterior distribution under Gaussian prior.

$$y \sim N(X\beta,\sigma^2I),\quad \text{prior }\beta \sim N(0,\gamma^2 I).$$

$$\hat{\beta} = \left(X^TX + \frac{\sigma^2}{\gamma^2}I\right)^{-1}X^Ty.$$

What I'm trying to show is want to show that $\mu$ = $\hat{B}$, for $\mu$ in $$-\frac{1}{2}(\beta – \mu)^T\Sigma^{-1}(\beta – \mu)$$ $\Sigma^{-1}$ is the covariance matrix for the posterior distribution $p(\beta\mid X,y)$.

There is a solution to this question the last couple of lines on page 3 from http://ssli.ee.washington.edu/courses/ee511/HW/hw3_solns.pdf. I'm baffled as to how it does this. (The problem is exercise 3.6.)

Edit: $\mu$ is the mean of the posterior.

Edit2: Last couple of lines of problem 3.6 say "which is the single $\beta$ term in the $p(\beta|y, X)$ equation." What is the single $\beta$ term? This sentence makes no sense to me. I'm not sure what the relevance of saying something is the the singe $\beta$ term to this proof.

Edit2 continued: For convenience, "$m_b = \frac{1}{\sigma^2I}\Sigma_bX^Ty$, and

$B^T\Sigma_b^{-1}m_b = \frac{1}{\sigma^2}B^tX^ty$, which is the single $\beta$ term in $p(\beta |X,y)$ equation." (me: okay, how is this helpful to the proof?)

Best Answer

Ridge

The ridge estimator is defined as the solution to the following optimisation problem:

$$ \beta_R = \arg\min_\beta \|X\beta -y\|_2^2 + {\kappa\over 2}\|\beta\|_2^2 $$

i.e., the usual OLS loss (first term) with an L2 regularisation term and tuning "hyper"-parameter $\kappa\ge 0$. If $\kappa=0$ it's the OLS case. This problem is convex and differentiable, setting the gradient in $\beta$ to zero leads to

$$ X^T(X\beta_R -y) + \kappa \beta_R = 0 $$

which is equivalent to $(X^T X + \kappa I)\beta_R = X^Ty $ and for $\kappa>0$ the matrix on the LHS is positive definite (easy to prove) and can therefore be inverted leading the standard form:

$$\beta_R = (X^T X+\kappa I)^{-1} X^Ty. $$

Bayesian regression

The posterior in $\beta$ is proportional to the likelihood (gaussian) times the prior (gaussian):

$$ p(\beta | y) \propto p(y|\beta) \pi(\beta) $$

Gaussians x Gaussian is still Gaussian. So the posterior is a Gaussian with a mean $\beta_G$ and a covariance $\Sigma_G$.

I will show that $\beta_G = \beta_R$ from two different ways.

1. Via the MAP

We know the posterior is a Gaussian. So the mode=mean=maximum. Therefore we can write:

$$ \beta_G = \arg\max_\beta L(y;\beta)\pi(\beta) $$

where $L(y;\beta)=p(y|\beta)$ is the likelihood of the data $y$ if the gaussian generating process is parameterised by $\beta$ i.e.: $\mathcal N(y;X\beta, \sigma^2I)$. Correspondingly $\pi(\beta)$ is the prior or $\mathcal N(\beta;0,\tau^2 I)$.

Now the $\arg\min$ does not change under a monotonous positive transformation such as the $\log$. Also additive constants are irrelevant. Therefore, we can write

$$\beta_G = \arg\max_\beta \log L(\beta) + \log \pi(\beta)$$

removing additive constants, this leads to

$$\beta_G = \arg\max_\beta -{1\over 2\sigma^2}(y-X\beta)^T(y-X\beta) - {1\over 2\tau^2} \beta^T\beta $$

changing sign, this becomes a minimisation problem (see how we're coming close to Ridge?) rearrange a bit and you get (again, get rid of additive constants and constant multiplicative terms):

$$ \beta_G = \arg\min_\beta {1\over \sigma^2}\left(\beta^T X^TX\beta - 2\beta^T X^T y\right) + {1\over \tau^2} \beta^T\beta $$

massage a bit more (multiply objective by $\sigma^2$)

$$ \beta_G = \arg\min_\beta \beta^T\left(X^TX+{\sigma^2\over \tau^2}I\right)\beta - 2\beta^TX^T y $$

setting the gradient in $\beta$ to zero leads to

$$\left(X^TX + {\sigma^2\over \tau^2}I\right)\beta_G = X^T y $$

which concludes the proof (just identify $\kappa \leftrightarrow \sigma^2/\tau^2$).

2. Quadratic form matching

You have a product of two gaussians leading to a Gaussian. We can ignore the proportionality constants and the $\exp$ so that you're effectively trying to match quadratic forms:

  • LHS: $(\beta-\beta_G)^T\Sigma_G^{-1}(\beta-\beta_G)$
  • RHS: ${1\over\sigma^2}(y-X\beta)^T(y-X\beta) + {1\over\tau^2}\beta^T\beta $

Matching quadratic terms in $\beta$

  • LHS: $\beta^T \Sigma_G^{-1} \beta$
  • RHS: $\beta^T(X^TX/\sigma^2 + I/\tau^2)\beta$

therefore

$$\Sigma_G^{-1} = (X^TX/\sigma^2 + I/\tau^2) $$

Linear term in $\beta$

  • LHS: $-2\beta^T \Sigma_G^{-1}\beta_G$
  • RHS: $-2\beta^T X^T y/\sigma^2$

therefore

$$ \beta_G = \Sigma_G X^T y / \sigma^2 = (X^TX + \sigma^2/\tau^2)^{-1} X^T y $$

Side remark

The first approach corresponds to taking the MAP estimator (maximum a posteriori). MAP estimators are typically identical to regularised MLE estimators. In this case, the MLE estimator is the OLS and it's regularised with an L2 norm, this corresponds to a MAP with a Gaussian likelihood and Gaussian prior.

You have a similar thing with the Lasso for example which corresponds to the MAP of a Gaussian likelihood with a Laplace prior.