Solved – Optimizing the ridge regression loss function with unpenalized intercept

gradient descentmachine learningoptimizationregressionridge regression

I've been trying to optimize the ridge regression loss function by gradient descent, but I've been banging my head against it for a while. The loss function can be written as

$$L(w,b)=\sum_{i=1}^n(y^{(i)}-(w\cdot x^{(i)}+b))^2+\lambda\|w\|^2_2$$

with $w$ being the vector of parameters, and $b$ is the intercept. Naively I would simply differentiate wrt. $w$ and try to minimize that function;

$$\frac{dL}{dw}=-2\sum_{i=1}^n(y^{(i)}-(w\cdot x^{(i)}+b))x^{(i)}+2\lambda\|w\|$$

Now, my initial idea would be to assimilate the intercept into the parameter vector in the loss function, and differentiate wrt. my new $\tilde{w}=(b,w)$. But if doing so, wouldn't this mean also changing the regularization term, and thus add a penalty on the intercept? It's my understanding that this should be avoided. How do I assimilate the intercept without this resulting in a penalty also on the intercept value in the regularization term?

Best Answer

I'd suggest assimilating via $v = (b, w)$ and doing $$ L(v) = \|y - Xv\|^2 + \lambda v^T \Omega v $$ where $$ \Omega = \text{diag}\left(0, 1, \dots, 1\right) $$ so $$ v^T\Omega v = \sum_{j \geq 2} v_j^2 = w^T w. $$

Now we have $$ L(v) = y^Ty - 2v^TX^Ty + v^T(X^TX + \lambda \Omega)v $$ so $$ \nabla L = -2X^Ty + 2 (X^TX + \lambda \Omega)v $$ and then $$ \nabla L \stackrel{\text{set}}= 0 \implies (X^TX + \lambda \Omega)v = X^Ty. $$


Regarding invertibility of $X^TX + \lambda \Omega$, let $z \in \mathbb R^p \backslash \{0\}$ and consider $$ z^T (X^TX + \lambda \Omega)z. $$ Let's try to make this $0$ to see if $X^TX + \lambda \Omega$ is PSD (positive semidefinite) or PD (positive definite). $z^T \Omega z = \sum_{j\geq 2}z_j^2$ so the only way to get $\lambda z^T \Omega z = 0$ is to have $z \propto e_1$, the first standard basis vector. So WLOG take $z = e_1$. Then $$ e_1^TX^TXe_1 = \|Xe_1\|^2 = \|X_1\|^2 $$ where $X_1$ is the first column of $X$. So this means that $X^TX + \lambda \Omega$ will be properly PD and not just PSD exactly when the first column of $X$, which we've made our unpenalized column by our choice of $\Omega$, is "full rank", which in this case just means non-zero. That's easy to check (just look at your data) so we don't have to worry about non-invertibility even though we're not doing a full ridge regression.


One slightly unsatisfactory part of this penalty is that it makes the Bayesian interpretation less nice because to recover this as a MAP estimate we'd need $$ v\sim \mathcal N\left(0, \frac{\sigma^2}\lambda \Omega^{-1}\right) $$ except $\Omega$ is singular, so we don't actually have a pdf for this (so e.g. if you wanted to choose $\lambda$ by maximizing the marginal likelihood as with a Gaussian process, now you can't, or at least you won't get a Gaussian likelihood since it's not defined). One possible way to get this back is to instead view this as a mixed model $$ y = b\mathbf 1 + Zw + \varepsilon $$ where $w\sim \mathcal N(0, \frac{\sigma^2}{\lambda} I)$ now plays the role of a random effect, and $Z$ is $X$ without the first intercept-giving column. The actual estimation process may change here depending on how much you want this to actually be a mixed model but I think it's an interesting alternative. In particular, we can now integrate $w$ out and still end up with a valid Gaussian likelihood, and in the spirit of REML we could integrate $b$ out too if we put an improper uniform prior on it.