Machine Learning – Understanding Ridge Regression Subtlety on Intercept

machine learningridge regression

I just noticed that when using ridge regression, there is a small subtlety on the penalised parameters, namely, we don't penalise $\theta_0$. Can someone give me a simple and intuitive explanation of why its important to keep the intercept out of the regularization component?

I assume the following optimization expression:

$$\hat{\theta}_{\textrm{ridge}} = \underset{\theta}{\operatorname{argmin}} \quad \sum_{i \leq n} (y_i – f(x_i))^2 + \lambda \sum_{1 \leq i \leq d} \theta_i^2 $$

Where $n$ is the number of data points in our dataset and $d+1$ the number of features $(\theta_0, \ldots, \theta_d)$. Note also that the intercept is implicitly included in my $f(x_i)$ function by defining $x_i := [1 \quad x_i]^T$ in order to catch $\theta_0$.

Thanks!

Best Answer

I will give you an unrigorous but intuitive reason as to why the intercept is not penalized. When we estimate a penalized model, we usually scale and centre the predictors. This means that the intercept is estimated to be the mean of the outcome variable.

Note that the mean of the outcome variable is the simplest prediction we could make (aside form predicting a random number unrelated to the outcome, in which case why use data at all, right?). Aside from the simplicity, the sample mean is also the minimizer of squared loss when we don't consider any other variables.

Penalizing the intercept would mean that we would bias our model's predictions away from the sample mean in extreme cases when all model parameters are shrunk towards 0. This would result in poorer predictions than we could make otherwise, or put another way, that we could actually minimize squared error further.