Solved – ny special case where ridge regression can shrink coefficients to zero

lassomachine learningridge regression

Are there some special cases, where the Ridge Regression can also lead to coefficients that are zero ?
It is widely known, that lasso is shrinking coefficients towards or on zero, while the ridge Regression cant shrink coefficients to zero

Best Answer

Suppose, as in the case of least squares methods, you are trying to solve a statistical estimation problem for a (vector-valued) parameter $\beta$ by minimizing an objective function $Q(\beta)$ (such as the sum of squares of the residuals). Ridge Regression "regularizes" the problem by adding a non-negative linear combination of the squares of the parameter, $P(\beta).$ $P$ is (obviously) differentiable with a unique global minimum at $\beta=0.$

The question asks, when is it possible for the global minimum of $Q+P$ to occur at $\beta=0$? Assume, as in least squares methods, that $Q$ is differentiable in a neighborhood of $0.$ Because $0$ is a global minimum for $Q+P$ it is a local minimum, implying all its partial derivatives are $0.$ The sum rule of differentiation implies

$$\frac{\partial}{\partial \beta_i}(Q(\beta) + P(\beta)) = \frac{\partial}{\partial \beta_i}Q(\beta) + \frac{\partial}{\partial \beta_i}P(\beta) = Q_i(\beta) + P_i(\beta)$$ is zero at $\beta=0.$ But since $P_i(0)=0$ for all $i,$ this implies $Q_i(0)=0$ for all $i,$ which makes $0$ at least a local minimum for the original objective function $Q.$ In the case of any least squares technique every local minimum is also a global minimum. This compels us to conclude that

Quadratic regularization of Least Squares procedures ("Ridge Regression") has $\beta=0$ as a solution if and only if $\beta=0$ is a solution of the original unregularized problem.

Related Solutions

Ridge Regression – Why It Doesn’t Shrink Coefficients to Zero Like Lasso

This is regarding the variance

OLS provides what is called the Best Linear Unbiased Estimator (BLUE). That means that if you take any other unbiased estimator, it is bound to have a higher variance then the OLS solution. So why on earth should we consider anything else than that?

Now the trick with regularization, such as the lasso or ridge, is to add some bias in turn to try to reduce the variance. Because when you estimate your prediction error, it is a combination of three things: $$ \text{E}[(y-\hat{f}(x))^2]=\text{Bias}[\hat{f}(x))]^2 +\text{Var}[\hat{f}(x))]+\sigma^2 $$ The last part is the irreducible error, so we have no control over that. Using the OLS solution the bias term is zero. But it might be that the second term is large. It might be a good idea, (if we want good predictions), to add in some bias and hopefully reduce the variance.

So what is this $\text{Var}[\hat{f}(x))]$? It is the variance introduced in the estimates for the parameters in your model. The linear model has the form $$ \mathbf{y}=\mathbf{X}\beta + \epsilon,\qquad \epsilon\sim\mathcal{N}(0,\sigma^2I) $$ To obtain the OLS solution we solve the minimization problem $$ \arg \min_\beta ||\mathbf{y}-\mathbf{X}\beta||^2 $$ This provides the solution $$ \hat{\beta}_{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} $$ The minimization problem for ridge regression is similar: $$ \arg \min_\beta ||\mathbf{y}-\mathbf{X}\beta||^2+\lambda||\beta||^2\qquad \lambda>0 $$ Now the solution becomes $$ \hat{\beta}_{\text{Ridge}} = (\mathbf{X}^T\mathbf{X}+\lambda I)^{-1}\mathbf{X}^T\mathbf{y} $$ So we are adding this $\lambda I$ (called the ridge) on the diagonal of the matrix that we invert. The effect this has on the matrix $\mathbf{X}^T\mathbf{X}$ is that it "pulls" the determinant of the matrix away from zero. Thus when you invert it, you do not get huge eigenvalues. But that leads to another interesting fact, namely that the variance of the parameter estimates becomes lower.

I am not sure if I can provide a more clear answer then this. What this all boils down to is the covariance matrix for the parameters in the model and the magnitude of the values in that covariance matrix.

I took ridge regression as an example, because that is much easier to treat. The lasso is much harder and there is still active ongoing research on that topic.

These slides provide some more information and this blog also has some relevant information.

EDIT: What do I mean that by adding the ridge the determinant is "pulled" away from zero?

Note that the matrix $\mathbf{X}^T\mathbf{X}$ is a positive definite symmetric matrix. Note that all symmetric matrices with real values have real eigenvalues. Also since it is positive definite, the eigenvalues are all greater than zero.

Ok so how do we calculate the eigenvalues? We solve the characteristic equation: $$ \text{det}(\mathbf{X}^T\mathbf{X}-tI)=0 $$ This is a polynomial in $t$, and as stated above, the eigenvalues are real and positive. Now let's take a look at the equation for the ridge matrix we need to invert: $$ \text{det}(\mathbf{X}^T\mathbf{X}+\lambda I-tI)=0 $$ We can change this a little bit and see: $$ \text{det}(\mathbf{X}^T\mathbf{X}-(t-\lambda)I)=0 $$ So we can solve this for $(t-\lambda)$ and get the same eigenvalues as for the first problem. Let's assume that one eigenvalue is $t_i$. So the eigenvalue for the ridge problem becomes $t_i+\lambda$. It gets shifted by $\lambda$. This happens to all the eigenvalues, so they all move away from zero.

Here is some R code to illustrate this:

# Create random matrix
A <- matrix(sample(10,9,T),nrow=3,ncol=3)

# Make a symmetric matrix
B <- A+t(A)

# Calculate eigenvalues
eigen(B)

# Calculate eigenvalues of B with ridge
eigen(B+3*diag(3))

Which gives the results:

> eigen(B)
$values
[1] 37.368634  6.952718 -8.321352

> eigen(B+3*diag(3))
$values
[1] 40.368634  9.952718 -5.321352

So all the eigenvalues get shifted up by exactly 3.

You can also prove this in general by using the Gershgorin circle theorem. There the centers of the circles containing the eigenvalues are the diagonal elements. You can always add "enough" to the diagonal element to make all the circles in the positive real half-plane. That result is more general and not needed for this.

Solved – Why does shrinkage work

Roughly speaking, there are three different sources of prediction error:

the bias of your model
the variance of your model
unexplainable variance

We can't do anything about point 3 (except for attempting to estimate the unexplained variance and incorporating it in our predictive densities and prediction intervals). This leaves us with 1 and 2.

If you actually have the "right" model, then, say, OLS parameter estimates will be unbiased and have minimal variance among all unbiased (linear) estimators (they are BLUE). Predictions from an OLS model will be best linear unbiased predictions (BLUPs). That sounds good.

However, it turns out that although we have unbiased predictions and minimal variance among all unbiased predictions, the variance can still be pretty large. More importantly, we can sometimes introduce "a little" bias and simultaneously save "a lot" of variance - and by getting the tradeoff just right, we can get a lower prediction error with a biased (lower variance) model than with an unbiased (higher variance) one. This is called the "bias-variance tradeoff", and this question and its answers is enlightening: When is a biased estimator preferable to unbiased one?

And regularization like the lasso, ridge regression, the elastic net and so forth do exactly that. They pull the model towards zero. (Bayesian approaches are similar - they pull the model towards the priors.) Thus, regularized models will be biased compared to non-regularized models, but also have lower variance. If you choose your regularization right, the result is a prediction with a lower error.

If you search for "bias-variance tradeoff regularization" or similar, you get some food for thought. This presentation, for instance, is useful.

EDIT: amoeba quite rightly points out that I am handwaving as to why exactly regularization yields lower variance of models and predictions. Consider a lasso model with a large regularization parameter $\lambda$. If $\lambda\to\infty$, your lasso parameter estimates will all be shrunk to zero. A fixed parameter value of zero has zero variance. (This is not entirely correct, since the threshold value of $\lambda$ beyond which your parameters will be shrunk to zero depends on your data and your model. But given the model and the data, you can find a $\lambda$ such that the model is the zero model. Always keep your quantifiers straight.) However, the zero model will of course also have a giant bias. It doesn't care about the actual observations, after all.

And the same applies to not-all-that-extreme values of your regularization parameter(s): small values will yield the unregularized parameter estimates, which will be less biased (unbiased if you have the "correct" model), but have higher variance. They will "jump around", following your actual observations. Higher values of your regularization $\lambda$ will "constrain" your parameter estimates more and more. This is why the methods have names like "lasso" or "elastic net": they constrain the freedom of your parameters to float around and follow the data.

(I am writing up a little paper on this, which will hopefully be rather accessible. I'll add a link once it's available.)

Best Answer

Related Solutions

Ridge Regression – Why It Doesn’t Shrink Coefficients to Zero Like Lasso

Solved – Why does shrinkage work

Related Question