In the Element of Statistical Learning, Chapter 3, we know that the linear regression is scale-invariant since the scale matrix for coefficient will be canceled eventually, but the Ridge regression doesn't have it? Since the form of Ridge coefficient has the closed-form
$$
\beta = (X^{T}X + \lambda I)^{-1}X^{T}Y,
$$
I don't see why the scale-invariance doesn't hold in here?
Can anyone suggest a prove of it?
Ridge Regression – Why the Ridge Regression Is Not Scale-Invariant?
regressionridge regression
Best Answer
The intuition here is that there's a sleight-of-hand happening when you use the same symbol $X$ for both the original data and the rescaled data. It's misleading because the rescaling $\tilde{X}= XD$ is not the same as the original $X$, so we should make that explicit and write down how we're rescaling.
We can demonstrate this by considering two cases, first with the original units in $X$ and second the case where we use a rescaled matrix $\tilde{X}= XD$ where $D$ is a diagonal matrix that has all positive entries on the diagonal. If $X$ has shape $n \times p$ then $D$ has shape $p \times p$. (You can actually use any $D_{ii} \neq 0$ but "rescaling" is almost always meant to be restricted to multiplication by a positive scalar.)
In the first case, we have $$\beta(X) = (X^TX + \lambda I)^{-1}X^T y$$ which is just as written in the question.
In the second case, we apply the rescaling to $X$ and we have $$\begin{aligned} \beta(\tilde{X}) &= (\tilde{X}^T\tilde{X} + \lambda I)^{-1}\tilde{X}^T y\\ &= (DX^TXD + \lambda I)^{-1}D X^Ty \\ &= (D(X^\top X + \lambda D^{-2})D)^{-1}DX^Ty \\ &= D^{-1}(X^T X + \lambda D^{-2})^{-1}X^Ty \end{aligned}$$
(remembering that $D$ is diagonal, so $D^T = D$).
From this we can conclude that the coefficients $\beta_X$ and $\beta_\tilde{X}$ are only the same if $D=I$.
The final line shows that the rescaling two effects on the coefficients.