Solved – Optimizing the ridge regression loss function with unpenalized intercept

gradient descentmachine learningoptimizationregressionridge regression

I've been trying to optimize the ridge regression loss function by gradient descent, but I've been banging my head against it for a while. The loss function can be written as

$$L(w,b)=\sum_{i=1}^n(y^{(i)}-(w\cdot x^{(i)}+b))^2+\lambda\|w\|^2_2$$

with $w$ being the vector of parameters, and $b$ is the intercept. Naively I would simply differentiate wrt. $w$ and try to minimize that function;

$$\frac{dL}{dw}=-2\sum_{i=1}^n(y^{(i)}-(w\cdot x^{(i)}+b))x^{(i)}+2\lambda\|w\|$$

Now, my initial idea would be to assimilate the intercept into the parameter vector in the loss function, and differentiate wrt. my new $\tilde{w}=(b,w)$. But if doing so, wouldn't this mean also changing the regularization term, and thus add a penalty on the intercept? It's my understanding that this should be avoided. How do I assimilate the intercept without this resulting in a penalty also on the intercept value in the regularization term?

Best Answer

I'd suggest assimilating via $v = (b, w)$ and doing $$ L(v) = \|y - Xv\|^2 + \lambda v^T \Omega v $$ where $$ \Omega = \text{diag}\left(0, 1, \dots, 1\right) $$ so $$ v^T\Omega v = \sum_{j \geq 2} v_j^2 = w^T w. $$

Now we have $$ L(v) = y^Ty - 2v^TX^Ty + v^T(X^TX + \lambda \Omega)v $$ so $$ \nabla L = -2X^Ty + 2 (X^TX + \lambda \Omega)v $$ and then $$ \nabla L \stackrel{\text{set}}= 0 \implies (X^TX + \lambda \Omega)v = X^Ty. $$

Regarding invertibility of $X^TX + \lambda \Omega$, let $z \in \mathbb R^p \backslash \{0\}$ and consider $$ z^T (X^TX + \lambda \Omega)z. $$ Let's try to make this $0$ to see if $X^TX + \lambda \Omega$ is PSD (positive semidefinite) or PD (positive definite). $z^T \Omega z = \sum_{j\geq 2}z_j^2$ so the only way to get $\lambda z^T \Omega z = 0$ is to have $z \propto e_1$, the first standard basis vector. So WLOG take $z = e_1$. Then $$ e_1^TX^TXe_1 = \|Xe_1\|^2 = \|X_1\|^2 $$ where $X_1$ is the first column of $X$. So this means that $X^TX + \lambda \Omega$ will be properly PD and not just PSD exactly when the first column of $X$, which we've made our unpenalized column by our choice of $\Omega$, is "full rank", which in this case just means non-zero. That's easy to check (just look at your data) so we don't have to worry about non-invertibility even though we're not doing a full ridge regression.

One slightly unsatisfactory part of this penalty is that it makes the Bayesian interpretation less nice because to recover this as a MAP estimate we'd need $$ v\sim \mathcal N\left(0, \frac{\sigma^2}\lambda \Omega^{-1}\right) $$ except $\Omega$ is singular, so we don't actually have a pdf for this (so e.g. if you wanted to choose $\lambda$ by maximizing the marginal likelihood as with a Gaussian process, now you can't, or at least you won't get a Gaussian likelihood since it's not defined). One possible way to get this back is to instead view this as a mixed model $$ y = b\mathbf 1 + Zw + \varepsilon $$ where $w\sim \mathcal N(0, \frac{\sigma^2}{\lambda} I)$ now plays the role of a random effect, and $Z$ is $X$ without the first intercept-giving column. The actual estimation process may change here depending on how much you want this to actually be a mixed model but I think it's an interesting alternative. In particular, we can now integrate $w$ out and still end up with a valid Gaussian likelihood, and in the spirit of REML we could integrate $b$ out too if we put an improper uniform prior on it.

Related Solutions

Solved – Should the lambda of ridge regression be related to number of data points

There are many methods for selecting the regularization parameter ridge regression (also known as Tikhonov regularization) including the L-curve method, the discrepancy principle, and generalized cross validation. Assuming that the least squares problem is ill-conditioned so as to require regularization, all of the commonly used methods that I've just mentioned produce a value of $\lambda$ that depends not just on the number of data points, but also on the particular values of $y$. Using any of these methods, the value of $\lambda$ that is selected will typically increase with the number of data points.

In machine learning applications, it's typical to adjust $\lambda$ so that that the prediction error over the validation set is minimized. Here, the choice of $\lambda$ depends on both the training data and the validation data. As the size of the training set grows, it's typical for the optimal $\lambda$ to grow.

If you think of ridge regression as finding the Bayesian maximum a posteriori probability (MAP) solution beginning with a multivariate normal prior for the parameters $\beta$ and assuming a multivariate normal likelihood, then you get the ridge regression least squares problem to estimate the MAP solution. In this framework, since $\lambda$ is associated with the prior covariance it should not change as you add more data points.

Solved – Naive Ridge Regression in R

First, very simply, I don't think your call to solve looks right, this is what I would expect

solve(t(X) %*% X + lambda.diag, t(X) %*% y)

Your code seems to be explicitly calculating a matrix inverse and then multiplying. This is mathematically correct, but computationally incorrect. It is always better to solve the system of equations. I've gotten in the habit of reading equations like $y = X^{-1}z$ as "solve the system of equations $Xy = z$ for $y$."

On a more mathematical note, you should not have to include an intercept term when fitting a ridge regression.

It is very important, when applying penalized methods, to standardize your data (as you point out with your comment about scale. It's also important to realize that penalties are not generally applied to the intercept term, as this would cause the model to violate the attractive property that the average predictions equal the average response (on the training data).

Together, these facts (centered data, no intercept penalty) imply that the intercept parameter estimate in a ridge regression is known a priori, it is zero.

The coefficient vector from ridge regression is the solution to the penalized optimization problem

$$ \beta = argmin \left( (y - X\beta)^t (y - X\beta) + \frac{1}{2}\sum_{j > 0} \beta_j^2 \right) $$

Taking a partial with respect to the intercept parameter

$$ \frac{\partial L}{\partial \beta_0} = \sum_{i=1}^{n} \left( y - \sum_{j=0}^q \beta_j x_{ij} \right) x_{i0} $$

But $x_{0i}$ are the entries in the model matrix corresponding to the intercept, so $x_{0i} = 1$ always. So we get

$$\sum_{i=1} y_i + \sum_{j=0}^q \beta_j \sum_{i=1}^n x_{ij} $$

The first term, with the sum over y, is zero because $y$ is centered (or not, a good check of understanding is to work out what happens if you don't center y). In the second term, each predictor is centered, so the sum over $i$ is zero for every predictor $j$ except the intercept. For the intercept, the second term $i$ sum comes out to $n$ (it's $1 + 1 + 1 + \cdots$). So this whole thing reduces to

$$ n \beta_0 $$

Setting this partial equal to zero, $n\beta_0 = 0$, we recover $\beta_0 = 0$, as expected.

So, you do not need to bind on an intercept term to your model matrix. Your function should either expect standardized data (and if you plan on making it public, it should check this is so), or standardize the data itself. Once this is done, the intercept is known to be zero. I'll leave it as an exersize to work out what the intercept should be when you translate the coefficients back to the un-normalized scale.

I still don't get results equivalent to lm.ridge though, but it might just be a question of translating the formula back into the original scales. However, I can't seem to work out how to do this. I thought it would just entail multiplying by the standard deviation of the response and adding the mean, as usual for standard scores, but either my function is still wrong or rescaling is more complex than I realize.

It's a bit more complicated, but not too bad if you are careful. Here's a place where I answered a very similar question:

GLMnet - “Unstandardizing” Linear Regression Coefficients

You may have to make a very simple change if you are not standardizing $y$.

Best Answer

Related Solutions

Solved – Should the lambda of ridge regression be related to number of data points

Solved – Naive Ridge Regression in R

Related Question