Solved – Deriving the Ridge Regression $\boldsymbol{\beta}\mid \mathbf{y}$ distribution

bayesianmathematical-statisticsridge regressionself-studystatistical-learning

Apparently the estimate $\hat{\boldsymbol{\beta}}$ for ridge regression comes up as the mean or mode of the posterior distribution given by $f_{\boldsymbol{\beta}\mid \mathbf{y}}$.

This is the closest that I've found to a clear statement of what the distribution of $\mathbf{y} \mid \boldsymbol{\beta}$ and $\boldsymbol{\beta}$ should be. We have
$$\begin{align*}
&\mathbf{y} \mid \boldsymbol{\beta} \sim \mathcal{N}_{N}\left(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I} \right) \text{ (usual linear model assumptions)} \\
&\boldsymbol{\beta} \sim \mathcal{N}_{p}\left(\mathbf{0}, \dfrac{\sigma^2}{\lambda}\mathbf{I} \right)
\end{align*}$$
Using Bayes' theorem,
$$\begin{align*}
f_{\boldsymbol{\beta}\mid \mathbf{y}}(\mathbf{b}\mid \mathbf{t}) &\propto f_{ \mathbf{y} \mid \boldsymbol{\beta}}(\mathbf{t}\mid \mathbf{b})\pi_{\boldsymbol{\beta}}(\mathbf{b}) \\
&\propto \exp\left\{\dfrac{-1}{2}\left[\dfrac{\left(\mathbf{t}-\mathbf{X}\boldsymbol{\beta}\right)^{T}\left(\mathbf{t}-\mathbf{X}\boldsymbol{\beta}\right)+\lambda\mathbf{b}^{T}\mathbf{b} }{\sigma^2} \right] \right\}
\end{align*}$$
using the density given here. We can assume that $\sigma^2 > 0$, therefore, their variance matrices are positive definite, and thus their densities exist.

Apparently this is supposed to be normally distributed, but I don't see why. In particular, I'm not sure how to get the result at the link above.

Best Answer

See if you can derive if from this more general result:

If $\mathbf{y}\sim \text{N}(\mathbf{X}\mathbf\beta,\mathbf{R})$ and $\mathbf\beta \sim \text{N}(\mathbf{a},\mathbf{B})$ then the posterior is $\mathbf\beta|\mathbf{y} \sim \text{N}(\mathbf\mu, \mathbf\Sigma)$ where $$\mathbf\mu = \mathbf\Sigma\left(\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{y} + \mathbf{B}^{-1}\mathbf{a}\right)\quad\text{and}\quad\mathbf\Sigma = \left(\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{X} + \mathbf{B}^{-1}\right)^{-1}$$

In order to identify the kernel of the distribution of the posterior, we will only keep track of terms involving $\mathbf\beta$.

$$\begin{align*} p(\mathbf\beta|\mathbf{y}) &\propto p(\mathbf{y}|\mathbf\beta)p(\mathbf\beta)\\ &\propto \exp\left\{ -\dfrac{1}{2}\left[(\mathbf{y}-\mathbf{X}\mathbf\beta)^\intercal\mathbf{R}^{-1}(\mathbf{y}-\mathbf{X}\mathbf\beta) + (\mathbf\beta-\mathbf{a})^\intercal\mathbf{B}^{-1}(\mathbf\beta-\mathbf{a})\right]\right\}\\ &\propto \exp\left\{ -\dfrac{1}{2}\left[ \mathbf{y}^\intercal\mathbf{R}^{-1}\mathbf{y} - \mathbf{y}^\intercal\mathbf{R}^{-1}\mathbf{X}\mathbf\beta - \mathbf{\beta}^\intercal\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{y}\right.\right.\\ &\qquad + \left.\vphantom{\dfrac{1}{2}}\left.\mathbf{\beta}^\intercal\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{X}\mathbf\beta + \mathbf{\beta}^\intercal\mathbf{B}^{-1}\mathbf\beta - \mathbf{\beta}^\intercal\mathbf{B}^{-1}\mathbf{a} - \mathbf{a}^\intercal\mathbf{B}^{-1}\mathbf\beta + \mathbf{a}^\intercal\mathbf{B}^{-1}\mathbf{a}\right]\right\} \end{align*}$$ dropping terms not involving $\mathbf\beta$ $$\begin{align*} &\propto \exp\left\{ -\dfrac{1}{2}\left[ - \mathbf{y}^\intercal\mathbf{R}^{-1}\mathbf{X}\mathbf\beta - \mathbf{\beta}^\intercal\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{y} + \mathbf{\beta}^\intercal\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{X}\mathbf\beta\right.\right. \\ &\qquad+ \left.\vphantom{\dfrac{1}{2}}\left.\mathbf{\beta}^\intercal\mathbf{B}^{-1}\mathbf\beta - \mathbf{\beta}^\intercal\mathbf{B}^{-1}\mathbf{a} - \mathbf{a}^\intercal\mathbf{B}^{-1}\mathbf\beta \right]\right\}\\ &\propto \exp\left\{ -\dfrac{1}{2}\left[ \mathbf{\beta}^\intercal\underbrace{(\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{X}+\mathbf{B}^{-1})}_{\mathbf\Sigma^{-1}}\mathbf\beta - \mathbf{\beta}^\intercal(\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{y}+\mathbf{B}^{-1}\mathbf{a}) \right.\right.\qquad - \left.\left.\vphantom{\dfrac{1}{2}a_\underbrace{\Sigma}}(\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{y}+\mathbf{B}^{-1}\mathbf{a})^\intercal\mathbf\beta\right]\right\} \end{align*}$$ Multiplying by the identity, $\mathbf{I} = \mathbf{\Sigma}^{-1}\mathbf{\Sigma}$ $$\begin{align*} &\propto \exp\left\{ -\dfrac{1}{2}\left[ \mathbf{\beta}^\intercal\mathbf\Sigma^{-1}\mathbf\beta - \mathbf\beta^\intercal\mathbf{\Sigma}^{-1}\mathbf\mu - \mathbf\mu^\intercal\mathbf\Sigma^{-1}\mathbf\beta\right]\right\} \end{align*}$$ multiplying and dividing by $\exp\left\{\mathbf\mu^\intercal\mathbf\Sigma^{-1}\mathbf\mu\right\}$ $$\begin{align*} &\propto \exp\left\{ -\dfrac{1}{2}\left[ \mathbf{\beta}^\intercal\mathbf\Sigma^{-1}\mathbf\beta - \mathbf\beta^\intercal\mathbf{\Sigma}^{-1}\mathbf\mu - \mathbf\mu^\intercal\mathbf\Sigma^{-1}\mathbf\beta + \mathbf\mu^\intercal\mathbf\Sigma^{-1}\mathbf\mu\right]\right\}\\ &\propto \exp\left\{ -\dfrac{1}{2}(\mathbf\beta - \mathbf\mu)^\intercal\mathbf\Sigma^{-1}(\mathbf\beta - \mathbf\mu)\right\} \end{align*}$$ Therefore the posterior is normal with mean $\mathbf\mu$ and variance covariance matrix $\mathbf\Sigma$.