Solved – Estimation of Bayesian Ridge Regression

bayesianregressionridge regression

According to scikit-learn, by using a probabilistic model :

$p(y|X,\omega,\alpha) = \mathcal{N}(y|X\omega,\alpha)$

with $\omega$ given by a spherical Gaussian:
$p(\omega|\lambda) = \mathcal{N}(\omega|0,\lambda^{-1}\mathbf{I_p})$

it is now a Bayesian model of ridge regression. So can i say that the estimation of this model on unknown data $X^*$ is a probability distribution on y with mean $\mu$ = $X\omega$ and variance $\sigma^2 = \alpha$ or $\sigma^2=\lambda^{-1}\mathbf{I_p}$ ? What exactly do $\alpha$ and $\lambda$ do in the equations ?

Best Answer

What the description in the sklearn documentation says is that the model is a regression model with extra regularization parameter for the coefficients. The model is

$$\begin{align} y &\sim \mathcal{N}(\mu, \alpha^{-1}) \\ \mu &= X\omega \\ \omega &\sim \mathcal{N}(0, \lambda^{-1}\mathbf{I}_p) \\ \alpha &\sim \mathcal{G}(\alpha_1, \alpha_2) \\ \lambda &\sim \mathcal{G}(\lambda_1, \lambda_2) \end{align}$$

So $y$ follows normal distribution (the likelihood function) parametrized by mean $\mu = X\omega$ and variance $\alpha^{-1}$. Where we choose Gamma priors for $\alpha$ and regularizing parameter $\lambda$, the distributions have hyperpriors $\alpha_1, \alpha_2, \lambda_1, \lambda_2$. The regression parameters $\omega$ have independent Gaussian priors with mean $0$ and variance $\lambda^{-1}$, so $\lambda$ serves as a regularization parameter (it is a precision parameter, so the larger $\lambda$, the $\omega$ values are a priori assumed to be more concentrated around zero).

Variance of Ridge Estimator

Let $\hat{\beta^*}$ be the ridge estimate under penalty $k$, and let $\beta$ be the true parameter for the model $Y = X \beta + \epsilon$. Let $\lambda_1, \dotsc, \lambda_p$ be the eigenvalues of $X^T X$.
From Hoerl & Kennard equations 4.2-4.5, the risk, (in terms of the expected $L^2$ norm of the error) is

$$ \begin{align*} E \left( \left[ \hat{\beta^*} - \beta \right]^T \left[ \hat{\beta^*} - \beta \right] \right)& = \sigma^2 \sum_{j=1}^p \lambda_j/ \left( \lambda_j +k \right)^2 + k^2 \beta^T \left( X^T X + k \mathbf{I}_p \right)^{-2} \beta \\ & = \gamma_1 (k) + \gamma_2(k) \\ & = R(k) \end{align*} $$ where as far as I can tell, $\left( X^T X + k \mathbf{I}_p \right)^{-2} = \left( X^T X + k \mathbf{I}_p \right)^{-1} \left( X^T X + k \mathbf{I}_p \right)^{-1}.$ They remark that $\gamma_1$ has the interpretation of the variance of the inner product of $\hat{\beta^*} - \beta$, while $\gamma_2$ is the inner product of the bias.

Supposing $X^T X = \mathbf{I}_p$, then $$R(k) = \frac{p \sigma^2 + k^2 \beta^T \beta}{(1+k)^2}.$$ Let $$R^\prime (k) = 2\frac{k(1+k)\beta^T \beta - (p\sigma^2 + k^2 \beta^T \beta)}{(1+k)^3}$$ be the derivative of the risk w/r/t $k$. Since $\lim_{k \rightarrow 0^+} R^\prime (k) = -2p \sigma^2 < 0$, we conclude that there is some $k^*>0$ such that $R(k^*)<R(0)$.

The authors remark that orthogonality is the best that you can hope for in terms of the risk at $k=0$, and that as the condition number of $X^T X$ increases, $\lim_{k \rightarrow 0^+} R^\prime (k)$ approaches $- \infty$.

Comment

There appears to be a paradox here, in that if $p=1$ and $X$ is constant, then we are just estimating the mean of a sequence of Normal$(\beta, \sigma^2)$ variables, and we know the the vanilla unbiased estimate is admissible in this case. This is resolved by noticing that the above reasoning merely provides that a minimizing value of $k$ exists for fixed $\beta^T \beta$. But for any $k$, we can make the risk explode by making $\beta^T \beta$ large, so this argument alone does not show admissibility for the ridge estimate.

Why is ridge regression usually recommended only in the case of correlated predictors?

H&K's risk derivation shows that if we think that $\beta ^T \beta$ is small, and if the design $X^T X$ is nearly-singular, then we can achieve large reductions in the risk of the estimate. I think ridge regression isn't used ubiquitously because the OLS estimate is a safe default, and that the invariance and unbiasedness properties are attractive. When it fails, it fails honestly--your covariance matrix explodes. There is also perhaps a philosophical/inferential point, that if your design is nearly singular, and you have observational data, then the interpretation of $\beta$ as giving changes in $E Y$ for unit changes in $X$ is suspect--the large covariance matrix is a symptom of that.

But if your goal is solely prediction, the inferential concerns no longer hold, and you have a strong argument for using some sort of shrinkage estimator.

Self-Study – Deriving the Ridge Regression $\boldsymbol{\beta}\mid \mathbf{y}$ Distribution

See if you can derive if from this more general result:

If $\mathbf{y}\sim \text{N}(\mathbf{X}\mathbf\beta,\mathbf{R})$ and $\mathbf\beta \sim \text{N}(\mathbf{a},\mathbf{B})$ then the posterior is $\mathbf\beta|\mathbf{y} \sim \text{N}(\mathbf\mu, \mathbf\Sigma)$ where $$\mathbf\mu = \mathbf\Sigma\left(\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{y} + \mathbf{B}^{-1}\mathbf{a}\right)\quad\text{and}\quad\mathbf\Sigma = \left(\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{X} + \mathbf{B}^{-1}\right)^{-1}$$

In order to identify the kernel of the distribution of the posterior, we will only keep track of terms involving $\mathbf\beta$.

$$\begin{align*} p(\mathbf\beta|\mathbf{y}) &\propto p(\mathbf{y}|\mathbf\beta)p(\mathbf\beta)\\ &\propto \exp\left\{ -\dfrac{1}{2}\left[(\mathbf{y}-\mathbf{X}\mathbf\beta)^\intercal\mathbf{R}^{-1}(\mathbf{y}-\mathbf{X}\mathbf\beta) + (\mathbf\beta-\mathbf{a})^\intercal\mathbf{B}^{-1}(\mathbf\beta-\mathbf{a})\right]\right\}\\ &\propto \exp\left\{ -\dfrac{1}{2}\left[ \mathbf{y}^\intercal\mathbf{R}^{-1}\mathbf{y} - \mathbf{y}^\intercal\mathbf{R}^{-1}\mathbf{X}\mathbf\beta - \mathbf{\beta}^\intercal\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{y}\right.\right.\\ &\qquad + \left.\vphantom{\dfrac{1}{2}}\left.\mathbf{\beta}^\intercal\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{X}\mathbf\beta + \mathbf{\beta}^\intercal\mathbf{B}^{-1}\mathbf\beta - \mathbf{\beta}^\intercal\mathbf{B}^{-1}\mathbf{a} - \mathbf{a}^\intercal\mathbf{B}^{-1}\mathbf\beta + \mathbf{a}^\intercal\mathbf{B}^{-1}\mathbf{a}\right]\right\} \end{align*}$$ dropping terms not involving $\mathbf\beta$ $$\begin{align*} &\propto \exp\left\{ -\dfrac{1}{2}\left[ - \mathbf{y}^\intercal\mathbf{R}^{-1}\mathbf{X}\mathbf\beta - \mathbf{\beta}^\intercal\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{y} + \mathbf{\beta}^\intercal\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{X}\mathbf\beta\right.\right. \\ &\qquad+ \left.\vphantom{\dfrac{1}{2}}\left.\mathbf{\beta}^\intercal\mathbf{B}^{-1}\mathbf\beta - \mathbf{\beta}^\intercal\mathbf{B}^{-1}\mathbf{a} - \mathbf{a}^\intercal\mathbf{B}^{-1}\mathbf\beta \right]\right\}\\ &\propto \exp\left\{ -\dfrac{1}{2}\left[ \mathbf{\beta}^\intercal\underbrace{(\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{X}+\mathbf{B}^{-1})}_{\mathbf\Sigma^{-1}}\mathbf\beta - \mathbf{\beta}^\intercal(\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{y}+\mathbf{B}^{-1}\mathbf{a}) \right.\right.\qquad - \left.\left.\vphantom{\dfrac{1}{2}a_\underbrace{\Sigma}}(\mathbf{X}^\intercal\mathbf{R}^{-1}\mathbf{y}+\mathbf{B}^{-1}\mathbf{a})^\intercal\mathbf\beta\right]\right\} \end{align*}$$ Multiplying by the identity, $\mathbf{I} = \mathbf{\Sigma}^{-1}\mathbf{\Sigma}$ $$\begin{align*} &\propto \exp\left\{ -\dfrac{1}{2}\left[ \mathbf{\beta}^\intercal\mathbf\Sigma^{-1}\mathbf\beta - \mathbf\beta^\intercal\mathbf{\Sigma}^{-1}\mathbf\mu - \mathbf\mu^\intercal\mathbf\Sigma^{-1}\mathbf\beta\right]\right\} \end{align*}$$ multiplying and dividing by $\exp\left\{\mathbf\mu^\intercal\mathbf\Sigma^{-1}\mathbf\mu\right\}$ $$\begin{align*} &\propto \exp\left\{ -\dfrac{1}{2}\left[ \mathbf{\beta}^\intercal\mathbf\Sigma^{-1}\mathbf\beta - \mathbf\beta^\intercal\mathbf{\Sigma}^{-1}\mathbf\mu - \mathbf\mu^\intercal\mathbf\Sigma^{-1}\mathbf\beta + \mathbf\mu^\intercal\mathbf\Sigma^{-1}\mathbf\mu\right]\right\}\\ &\propto \exp\left\{ -\dfrac{1}{2}(\mathbf\beta - \mathbf\mu)^\intercal\mathbf\Sigma^{-1}(\mathbf\beta - \mathbf\mu)\right\} \end{align*}$$ Therefore the posterior is normal with mean $\mathbf\mu$ and variance covariance matrix $\mathbf\Sigma$.

Best Answer

Related Solutions

Regression – Conditions for Ridge Regression to Improve Over Ordinary Least Squares

Variance of Ridge Estimator

Comment

Why is ridge regression usually recommended only in the case of correlated predictors?

Self-Study – Deriving the Ridge Regression $\boldsymbol{\beta}\mid \mathbf{y}$ Distribution

Related Question