I understand that with Ridge or Lasso regression we are trying to shrink regression coefficients, and we specify the amount of shrinking we need by varying alpha. But I cannot understand the intuition or rationale behind doing it? as we no longer fit the best line.
Solved – Rationale behind shrinking regression coefficients in Ridge or LASSO regression
lassoregressionridge regression
Related Solutions
This is regarding the variance
OLS provides what is called the Best Linear Unbiased Estimator (BLUE). That means that if you take any other unbiased estimator, it is bound to have a higher variance then the OLS solution. So why on earth should we consider anything else than that?
Now the trick with regularization, such as the lasso or ridge, is to add some bias in turn to try to reduce the variance. Because when you estimate your prediction error, it is a combination of three things: $$ \text{E}[(y-\hat{f}(x))^2]=\text{Bias}[\hat{f}(x))]^2 +\text{Var}[\hat{f}(x))]+\sigma^2 $$ The last part is the irreducible error, so we have no control over that. Using the OLS solution the bias term is zero. But it might be that the second term is large. It might be a good idea, (if we want good predictions), to add in some bias and hopefully reduce the variance.
So what is this $\text{Var}[\hat{f}(x))]$? It is the variance introduced in the estimates for the parameters in your model. The linear model has the form $$ \mathbf{y}=\mathbf{X}\beta + \epsilon,\qquad \epsilon\sim\mathcal{N}(0,\sigma^2I) $$ To obtain the OLS solution we solve the minimization problem $$ \arg \min_\beta ||\mathbf{y}-\mathbf{X}\beta||^2 $$ This provides the solution $$ \hat{\beta}_{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} $$ The minimization problem for ridge regression is similar: $$ \arg \min_\beta ||\mathbf{y}-\mathbf{X}\beta||^2+\lambda||\beta||^2\qquad \lambda>0 $$ Now the solution becomes $$ \hat{\beta}_{\text{Ridge}} = (\mathbf{X}^T\mathbf{X}+\lambda I)^{-1}\mathbf{X}^T\mathbf{y} $$ So we are adding this $\lambda I$ (called the ridge) on the diagonal of the matrix that we invert. The effect this has on the matrix $\mathbf{X}^T\mathbf{X}$ is that it "pulls" the determinant of the matrix away from zero. Thus when you invert it, you do not get huge eigenvalues. But that leads to another interesting fact, namely that the variance of the parameter estimates becomes lower.
I am not sure if I can provide a more clear answer then this. What this all boils down to is the covariance matrix for the parameters in the model and the magnitude of the values in that covariance matrix.
I took ridge regression as an example, because that is much easier to treat. The lasso is much harder and there is still active ongoing research on that topic.
These slides provide some more information and this blog also has some relevant information.
EDIT: What do I mean that by adding the ridge the determinant is "pulled" away from zero?
Note that the matrix $\mathbf{X}^T\mathbf{X}$ is a positive definite symmetric matrix. Note that all symmetric matrices with real values have real eigenvalues. Also since it is positive definite, the eigenvalues are all greater than zero.
Ok so how do we calculate the eigenvalues? We solve the characteristic equation: $$ \text{det}(\mathbf{X}^T\mathbf{X}-tI)=0 $$ This is a polynomial in $t$, and as stated above, the eigenvalues are real and positive. Now let's take a look at the equation for the ridge matrix we need to invert: $$ \text{det}(\mathbf{X}^T\mathbf{X}+\lambda I-tI)=0 $$ We can change this a little bit and see: $$ \text{det}(\mathbf{X}^T\mathbf{X}-(t-\lambda)I)=0 $$ So we can solve this for $(t-\lambda)$ and get the same eigenvalues as for the first problem. Let's assume that one eigenvalue is $t_i$. So the eigenvalue for the ridge problem becomes $t_i+\lambda$. It gets shifted by $\lambda$. This happens to all the eigenvalues, so they all move away from zero.
Here is some R code to illustrate this:
# Create random matrix
A <- matrix(sample(10,9,T),nrow=3,ncol=3)
# Make a symmetric matrix
B <- A+t(A)
# Calculate eigenvalues
eigen(B)
# Calculate eigenvalues of B with ridge
eigen(B+3*diag(3))
Which gives the results:
> eigen(B)
$values
[1] 37.368634 6.952718 -8.321352
> eigen(B+3*diag(3))
$values
[1] 40.368634 9.952718 -5.321352
So all the eigenvalues get shifted up by exactly 3.
You can also prove this in general by using the Gershgorin circle theorem. There the centers of the circles containing the eigenvalues are the diagonal elements. You can always add "enough" to the diagonal element to make all the circles in the positive real half-plane. That result is more general and not needed for this.
Does this indicate improvement of my predictor reduction from ridge to lasso?
No, the plots don't say anything about predictive performance. If you want to estimate that, you can use cross validation.
A.K.A, 6 predictors model does a better job in fitting the data than 8 predictors model?
Compared to ordinary least squares (OLS), regularized methods like lasso and ridge regression will give greater or equal error on the training data. But, if you're interested in predictive performance, what you really care about is error on future data generated by the same underlying distribution. This is what cross validation estimates. The method (and value of $\lambda$) that will perform best depends on the problem.
If you're interested in statistical inference (i.e. accounting for uncertainty in parameter estimates, or properly identifying an underlying 'true' model), then you'd need a way to compute p values, confidence intervals, etc. The standard procedures designed for OLS won't work for lasso and ridge regression. Also, keep in mind that there are many subtleties and caveats in identifying 'important variables'.
- Are the ridge regression estimates at the smallest $\lambda$ value exactly the same as the least squares estimates?
When $\lambda=0$ both ridge regression and lasso are equivalent to ordinary least squares (OLS). You can see this by writing the optimization problem for each method and setting $\lambda$ to zero:
$$\beta_{OLS} = \underset{\beta}{\text{argmin}} \sum_{i=1}^n (y_i - \beta \cdot x_i)^2$$
$$\beta_{lasso} = \underset{\beta}{\text{argmin}} \sum_{i=1}^n (y_i - \beta \cdot x_i)^2 + \lambda \|\beta\|_1$$
$$\beta_{ridge} = \underset{\beta}{\text{argmin}} \sum_{i=1}^n (y_i - \beta \cdot x_i)^2 + \lambda \|\beta\|_2^2$$
- How to interpret these two plots?
Each trajectory shows the value of an individual coefficient as $\lambda$ is changed. It looks like your x axis is mislabeled ($\lambda$ is actually decreasing from left to right).
Some general things you can notice in these plots (which are well known facts about lasso and ridge regression): Both methods shrink the coefficients more strongly toward zero as $\lambda$ increases (moving from right to left on the x axis). Lasso produces sparse solutions--as $\lambda$ increases, more and more coefficients are driven exactly to zero while others remain relatively large (which is why lasso is useful for variable selection). Ridge regression doesn't behave this way--as $\lambda$ increases, the overall magnitude of the coefficients decreases, but individual coefficients are not driven exactly to zero.
what does it mean for the ending points in red on the line or above or below
You said the red points represent the OLS coefficients. Because lasso and ridge regression shrink the coefficients toward zero, the magnitudes will be smaller than OLS when $\lambda > 0$. Your plots would intersect the red points at $\lambda=0$, where all methods are equivalent.
Related Question
- Solved – ny special case where ridge regression can shrink coefficients to zero
- Solved – How does Ridge Regression penalize for complexity if the coefficients are never allowed to go to zero
- R – Get Odds Ratios with Confidence Intervals from a Lasso Regression Model
- Feature Selection – Is Ridge More Robust than Lasso?
Best Answer
Here is the general intuition behind shrinking co-efficients in linear regression. Borrowing figures and equations from Pattern Recognition and Machine Learning by Bishop.
Imagine that you have to approximate the function, $y = sin(2\pi x)$ from $N$ observations. You can do this using linear regression, which approximates the $M$ degree polynomial,
$$ y(x, \textbf{w}) = \sum_{j=0}^{M}{w_j x^j} $$ by minimizing the error function,
$$ E( \textbf{w}) = \frac{1}{2} \sum_{n=1}^{N}{\{ y(x_n, \textbf{w}) - t_n \}^2} $$
By choosing different values of $M$, one can fit different degree polynomials of varying complexity. Here are some example fits (red lines) and the corresponding values of $M$. Blue dots represent the observations and the green line is the true underlying function. The goal is to fit a polynomial which closely approximates the underlying function (green line).
Watch what happens with the high degree polynomial, $M=9$. This polynomial gives the minimum error, since it passes through all the points. But this is not a good fit, because the model is fitting to the noise structure of the data, rather than the underlying function. Since the overall goal of linear regression is to be able to predict $t$ for an unknown value of $x$, you will be screwed with the model with the high degree polynomial!
Further, let's take a look at the values of the regression coefficients. Watch how the values of the coefficients exploded for the higher degree polynomial.
The solution to this problem is regularization! Where, the error function to be minimized, is redefined as follows:
$$ E'( \textbf{w}) = \frac{1}{2} \sum_{n=1}^{N}{\{ y(x_n, \textbf{w}) - t_n \}^2} + \frac{\lambda}{2} \Vert \textbf{w} \Vert^2 $$
This gives us the formulation of Ridge regression. The inclusion of the penalty term, $L2$ norm, discourages the values of the regression coefficients from reaching high values, thus preventing over-fitting.
Lasso has a similar formulation, where the penalty term is the $L1$ norm. Inclusion of the lasso penalty term has an interesting effect -- it drives some of the coefficients to zero giving a sparse solution.
$$ E''( \textbf{w}) = \frac{1}{2} \sum_{n=1}^{N}{\{ y(x_n, \textbf{w}) - t_n \}^2} + \frac{\lambda}{2} \Vert \textbf{w} \Vert^1 $$