Solved – Ridge Regression: how to show squared bias increases as $\lambda$ increases

bias-variance tradeoffregressionregularizationridge regressionvariance

I have a Ridge regression model to estimate the coefficients of the true model $y = X\beta + \epsilon$. I have the standard model where $\mathbb{E}[\epsilon] = 0, \ \mathrm{Var}(\epsilon) = I.$ The ridge estimator of $\beta$ is: $\beta^\mathrm{Ridge} = (X^\top X + \lambda I )^{-1} X^\top y$

Assume we have a fixed testing point $x_0$. I have proved that by increasing $\lambda$ the variance of estimation $$\hat{f}(x_0) = x_0^\top (X^\top X + \lambda I)^{-1} X^\top y$$
is decreasing.

Now I want to show that by increasing $\lambda$ the squared bias of the test estimation steadily increase.

I thought of using the bias-variance tradeoff, but it does not work since the tradeoff tells us
$$Error(x_0) = \text{Irreducible Error} + \mathrm{Bias}^2(\hat{f}(x_0)) +\mathrm{Variance}(\hat{f}(x_0)) . $$
To show that increased variance implies decreased bias, we need to have the same $Error(x_0)$ but this is not the case.

So, how can I show that the bias of our ridge estimation on the test data steadily increases with increasing $\lambda$?

Best Answer

I do not know if you are still interested in this issue. I think it will be useful for your problem to look at the limiting result of the estimator mean squared error (for a penalty parameter approaching infinity).

We can indicate with $\hat{\beta}_{r} = (X^\top X + \lambda I )^{-1} X^\top y$ the ridge estimator and with $\hat{\beta} = (X^\top X)^{-1} X^\top y$ the OLS estimator (which is unbiased, hence $E(\hat{\beta}) = \beta$). Now, if we define $K = (X^\top X + \lambda I )^{-1} X^\top X$ we can verify that $\hat{\beta}_{r} = K \hat{\beta}$ (so $K$ transforms the OLS estimator in the ridge one).

Then, keeping in mind the definition of $K$, it can be demonstrated that (see e.g. Hoerl and Kennard, 1970):

$$ \begin{array}{lll} MSE(\hat{\beta}_{r}) &= E[(\hat{\beta}_{r} - \beta)^\top (\hat{\beta}_{r} - \beta)] = \mbox{Var}(\hat{\beta}_{r}) + [\mbox{Bias}(\hat{\beta}_{r})]^2 \\ & = \sigma^{2}\mbox{tr}\{K (X^{\top} X)^{-1}K^{\top}\} + \beta^{\top}(K - I)^{\top}(K - I)\beta \\ \mbox{Var}(\hat{\beta}_{r}) &= \sigma^{2}\mbox{tr}\{K (X^{\top} X)^{-1}K^{\top}\} \\ [\mbox{Bias}(\hat{\beta}_{r})]^2 &= \beta^{\top}(K - I)^{\top}(K - I)\beta. \end{array} $$

From above we can compute $$ \lim_{\lambda \rightarrow\infty} MSE(\hat{\beta}_{r}) = \beta^\top \beta\\ $$

which is the squared bias of an estimator equal to zero (since the variance, as you pointed out, goes to zero for limiting $\lambda$). I hope this helps a bit (also I hope the notation is correct and clear enough).