Ridge Regression – Why the Ridge Regression Is Not Scale-Invariant?

regressionridge regression

In the Element of Statistical Learning, Chapter 3, we know that the linear regression is scale-invariant since the scale matrix for coefficient will be canceled eventually, but the Ridge regression doesn't have it? Since the form of Ridge coefficient has the closed-form
$$
\beta = (X^{T}X + \lambda I)^{-1}X^{T}Y,
$$

I don't see why the scale-invariance doesn't hold in here?
Can anyone suggest a prove of it?

Best Answer

The intuition here is that there's a sleight-of-hand happening when you use the same symbol $X$ for both the original data and the rescaled data. It's misleading because the rescaling $\tilde{X}= XD$ is not the same as the original $X$, so we should make that explicit and write down how we're rescaling.

We can demonstrate this by considering two cases, first with the original units in $X$ and second the case where we use a rescaled matrix $\tilde{X}= XD$ where $D$ is a diagonal matrix that has all positive entries on the diagonal. If $X$ has shape $n \times p$ then $D$ has shape $p \times p$. (You can actually use any $D_{ii} \neq 0$ but "rescaling" is almost always meant to be restricted to multiplication by a positive scalar.)

In the first case, we have $$\beta(X) = (X^TX + \lambda I)^{-1}X^T y$$ which is just as written in the question.

In the second case, we apply the rescaling to $X$ and we have $$\begin{aligned} \beta(\tilde{X}) &= (\tilde{X}^T\tilde{X} + \lambda I)^{-1}\tilde{X}^T y\\ &= (DX^TXD + \lambda I)^{-1}D X^Ty \\ &= (D(X^\top X + \lambda D^{-2})D)^{-1}DX^Ty \\ &= D^{-1}(X^T X + \lambda D^{-2})^{-1}X^Ty \end{aligned}$$

(remembering that $D$ is diagonal, so $D^T = D$).

From this we can conclude that the coefficients $\beta_X$ and $\beta_\tilde{X}$ are only the same if $D=I$.

The final line shows that the rescaling two effects on the coefficients.

  1. It has a multiplicative effect on the coefficients, just as we would intuitively expect based on what happens when we rescale in the OLS case.
  2. The last line makes explicit that the change in scale is "absorbed" in $\lambda$, and that the change in scale is gives $\beta(\tilde{X})_i$ penalized inversely to the square of the rescaling $D_{ii}$. (Thanks to Firebug for this helpful suggestion.)
Related Question