Solved – Ridge Regression -Increase in $\lambda$ leads to a decrease in flexibilty

regressionridge regression

In Introduction to Statistical Learning, in the part where ridge regression is explained, the authors say that

As $\lambda$ increases, the flexibility of the ridge regression fit
decreases, leading to decreased variance but increased bias.

Here is my take on proving this line:
In ridge regression we have to minimize the sum:$$RSS+\lambda\sum_{j=0}^n\beta_j\\=\sum_{i=1}^n(y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij})^2+\lambda\sum_{j=1}^p\beta_j^2$$
Here, we can see that a general increase in the $\beta$ vector will decrease $RSS$ and increase the other term. So, in order to minimize the whole term, a kind of equilibrium must be made between the $RSS$ term and the $\lambda\sum_{j=0}^p\beta_j^2$ term. Let their sum be $S$.
Now, if we increase $\lambda$ by $1$, then by using the previous value of the $\beta$ vector, $\lambda\sum_{j=1}^p\beta_j^2$ will increase, whereas $RSS$ will remain the same. Thus $S$ will increase. Now, to attain another equilibrium, we can see that decreasing the coefficients $\beta_j$ will restore the equilibrium.$^{[1]}$

Therefore as a general trend, we can say that if we increase the value
of $\lambda$ then the magnitude of the coefficients decreases.

Now, if the co-efficients of predictors decrease, then their value in the model decreases. That is, their effect decreases. And thus the flexibility of the model should decrease.


This proof appears appealing, but I have a gut feeling that there are some gaps here and there. If it is correct, good. But if it isn't I would like to know the reasons where this proof fails, and obviously, the correct version of it.


$^{[1]}$: I can attach a plausible explanation on this point, if needed.

Best Answer

Let's ignore the penalty term for a moment, while we explore the sensitivity of the solution to changes in a single observation. This has ramifications for all linear least-squares models, not just Ridge regression.

Notation

To simplify the notation, let $X$ be the model matrix, including a column of constant values (and therefore having $p+1$ columns indexed from $0$ through $p$), let $y$ be the response $n$-vector, and let $\beta=(\beta_0, \beta_1, \ldots, \beta_p)$ be the $p+1$-vector of coefficients. Write $\mathbf{x}_i = (x_{i0}, x_{i1}, \ldots, x_{ip})$ for observation $i$. The unpenalized objective is the (squared) $L_2$ norm of the difference,

$$RSS(\beta)=||y - X\beta||^2 = \sum_{i=1}^n (y_i - \mathbf{x}_i\beta)^2.\tag{1}$$

Without any loss of generality, order the observations so the one in question is the last. Let $k$ be the index of any one of the variables ($0 \le k \le p$).

Analysis

The aim is to expose the essential simplicity of this situation by focusing on how the sum of squares $RSS$ depends on $x_{nk}$ and $\beta_k$--nothing else matters. To this end, split $RSS$ into the contributions from the first $n-1$ observations and the last one:

$$RSS(\beta) = (y_n - \mathbf{x}_n\beta)^2 + \sum_{i=1}^{n-1} (y_i - \mathbf{x}_i\beta)^2.$$

Both terms are quadratic functions of $\beta_k$. Considering all the other $\beta_j,$ $j\ne k$, as constants for the moment, this means the objective can be written in the form

$$RSS(\beta_k) = (x_{nk}^2 \beta_k^2 + E\beta_kx_{nk} + F) + (A^2\beta_k^2 + B\beta_k + C).$$

The new quantities $A\cdots F$ do not depend on $\beta_k$ or $x_{nk}$. Combining the terms and completing the square gives something in the form

$$RSS(\beta_k) = \left(\beta_k\sqrt{x_{nk}^2 + A^2} + \frac{Ex_{nk}+B}{2\sqrt{x_{nk}^2+A^2}} \right)^2 + G - \frac{(Ex_{nk}+B)^2}{4(x_{nk}^2+A^2)}\tag{2}$$

where the quantity $G$ does not depend on $x_{nk}$ or $\beta_k$.

Estimating sensitivity

We may readily estimate the sizes of the coefficients in $(2)$ when $|x_{nk}|$ grows large compared to $|A|$. When that is the case,

$$RSS(\beta_k) \approx \left(\beta_k x_{nk} + E/2\right)^2 + G-E^2/4.$$

This makes it easy to see what changing $|x_{nk}|$ must do to the optimum $\hat\beta_k$. For sufficiently large $|x_{nk}|$, $\beta_k$ will be inversely proportional to $x_{nk}$.

We actually have learned, and proven, much more than was requested, because Ridge regression can be formulated as model $(1)$. Specifically, to the original $n$ observations you will adjoin $p+1$ fake observations of the form $\mathbf{x}_{n+i} = (0,0,\ldots, 0,1,0,\ldots,0)$ and then you will multiply them all by the penalty parameter $\lambda$. The preceding analysis shows that for $\lambda$ sufficiently large (and "sufficiently" can be computed in terms of $|A|$, which is a function of the actual data only), every one of the $\hat\beta_k$ will be approximately inversely proportional to $\lambda$.


An analysis that requires some more sophisticated results from Linear Algebra appears at The proof of shrinking coefficients using ridge regression through "spectral decomposition". It does add one insight: the coefficients in the asymptotic relationships $\hat\beta_k \sim 1/\lambda$ will be the reciprocal nonzero singular values of $X$.