Solved – Ridge Regression: how to show squared bias increases as $\lambda$ increases

bias-variance tradeoffregressionregularizationridge regressionvariance

I have a Ridge regression model to estimate the coefficients of the true model $y = X\beta + \epsilon$. I have the standard model where $\mathbb{E}[\epsilon] = 0, \ \mathrm{Var}(\epsilon) = I.$ The ridge estimator of $\beta$ is: $\beta^\mathrm{Ridge} = (X^\top X + \lambda I )^{-1} X^\top y$

Assume we have a fixed testing point $x_0$. I have proved that by increasing $\lambda$ the variance of estimation $$\hat{f}(x_0) = x_0^\top (X^\top X + \lambda I)^{-1} X^\top y$$
is decreasing.

Now I want to show that by increasing $\lambda$ the squared bias of the test estimation steadily increase.

I thought of using the bias-variance tradeoff, but it does not work since the tradeoff tells us
$$Error(x_0) = \text{Irreducible Error} + \mathrm{Bias}^2(\hat{f}(x_0)) +\mathrm{Variance}(\hat{f}(x_0)) . $$
To show that increased variance implies decreased bias, we need to have the same $Error(x_0)$ but this is not the case.

So, how can I show that the bias of our ridge estimation on the test data steadily increases with increasing $\lambda$?

Best Answer

I do not know if you are still interested in this issue. I think it will be useful for your problem to look at the limiting result of the estimator mean squared error (for a penalty parameter approaching infinity).

We can indicate with $\hat{\beta}_{r} = (X^\top X + \lambda I )^{-1} X^\top y$ the ridge estimator and with $\hat{\beta} = (X^\top X)^{-1} X^\top y$ the OLS estimator (which is unbiased, hence $E(\hat{\beta}) = \beta$). Now, if we define $K = (X^\top X + \lambda I )^{-1} X^\top X$ we can verify that $\hat{\beta}_{r} = K \hat{\beta}$ (so $K$ transforms the OLS estimator in the ridge one).

Then, keeping in mind the definition of $K$, it can be demonstrated that (see e.g. Hoerl and Kennard, 1970):

$$ \begin{array}{lll} MSE(\hat{\beta}_{r}) &= E[(\hat{\beta}_{r} - \beta)^\top (\hat{\beta}_{r} - \beta)] = \mbox{Var}(\hat{\beta}_{r}) + [\mbox{Bias}(\hat{\beta}_{r})]^2 \\ & = \sigma^{2}\mbox{tr}\{K (X^{\top} X)^{-1}K^{\top}\} + \beta^{\top}(K - I)^{\top}(K - I)\beta \\ \mbox{Var}(\hat{\beta}_{r}) &= \sigma^{2}\mbox{tr}\{K (X^{\top} X)^{-1}K^{\top}\} \\ [\mbox{Bias}(\hat{\beta}_{r})]^2 &= \beta^{\top}(K - I)^{\top}(K - I)\beta. \end{array} $$

From above we can compute $$ \lim_{\lambda \rightarrow\infty} MSE(\hat{\beta}_{r}) = \beta^\top \beta\\ $$

which is the squared bias of an estimator equal to zero (since the variance, as you pointed out, goes to zero for limiting $\lambda$). I hope this helps a bit (also I hope the notation is correct and clear enough).

Best Answer

Related Solutions

Solved – Variance-covariance matrix for ridge regression with stochastic $\lambda$

Solved – Under exactly what conditions is ridge regression able to provide an improvement over ordinary least squares regression

Variance of Ridge Estimator

Comment

Why is ridge regression usually recommended only in the case of correlated predictors?

Related Question