Solved – Rationale behind shrinking regression coefficients in Ridge or LASSO regression

lassoregressionridge regression

I understand that with Ridge or Lasso regression we are trying to shrink regression coefficients, and we specify the amount of shrinking we need by varying alpha. But I cannot understand the intuition or rationale behind doing it? as we no longer fit the best line.

Best Answer

Here is the general intuition behind shrinking co-efficients in linear regression. Borrowing figures and equations from Pattern Recognition and Machine Learning by Bishop.

Imagine that you have to approximate the function, $y = sin(2\pi x)$ from $N$ observations. You can do this using linear regression, which approximates the $M$ degree polynomial,

$$ y(x, \textbf{w}) = \sum_{j=0}^{M}{w_j x^j} $$ by minimizing the error function,

$$ E( \textbf{w}) = \frac{1}{2} \sum_{n=1}^{N}{\{ y(x_n, \textbf{w}) - t_n \}^2} $$

By choosing different values of $M$, one can fit different degree polynomials of varying complexity. Here are some example fits (red lines) and the corresponding values of $M$. Blue dots represent the observations and the green line is the true underlying function. The goal is to fit a polynomial which closely approximates the underlying function (green line).

Linear Regression fits v/s M

Watch what happens with the high degree polynomial, $M=9$. This polynomial gives the minimum error, since it passes through all the points. But this is not a good fit, because the model is fitting to the noise structure of the data, rather than the underlying function. Since the overall goal of linear regression is to be able to predict $t$ for an unknown value of $x$, you will be screwed with the model with the high degree polynomial!

Further, let's take a look at the values of the regression coefficients. Watch how the values of the coefficients exploded for the higher degree polynomial.

Linear regression coefficients

The solution to this problem is regularization! Where, the error function to be minimized, is redefined as follows:

$$ E'( \textbf{w}) = \frac{1}{2} \sum_{n=1}^{N}{\{ y(x_n, \textbf{w}) - t_n \}^2} + \frac{\lambda}{2} \Vert \textbf{w} \Vert^2 $$

This gives us the formulation of Ridge regression. The inclusion of the penalty term, $L2$ norm, discourages the values of the regression coefficients from reaching high values, thus preventing over-fitting.

Lasso has a similar formulation, where the penalty term is the $L1$ norm. Inclusion of the lasso penalty term has an interesting effect -- it drives some of the coefficients to zero giving a sparse solution.

$$ E''( \textbf{w}) = \frac{1}{2} \sum_{n=1}^{N}{\{ y(x_n, \textbf{w}) - t_n \}^2} + \frac{\lambda}{2} \Vert \textbf{w} \Vert^1 $$