I can understand that lasso could make some weights to zero and prevent over-fitting. But for all the figure I see about lasso regression, weight will stay at zero once it reach zero and increasing lambda won't drive it to negative? Why is that?
Solved – Why weights are not negative in Lasso regression
lassoregressionregularizationridge regression
Related Solutions
This is regarding the variance
OLS provides what is called the Best Linear Unbiased Estimator (BLUE). That means that if you take any other unbiased estimator, it is bound to have a higher variance then the OLS solution. So why on earth should we consider anything else than that?
Now the trick with regularization, such as the lasso or ridge, is to add some bias in turn to try to reduce the variance. Because when you estimate your prediction error, it is a combination of three things: $$ \text{E}[(y-\hat{f}(x))^2]=\text{Bias}[\hat{f}(x))]^2 +\text{Var}[\hat{f}(x))]+\sigma^2 $$ The last part is the irreducible error, so we have no control over that. Using the OLS solution the bias term is zero. But it might be that the second term is large. It might be a good idea, (if we want good predictions), to add in some bias and hopefully reduce the variance.
So what is this $\text{Var}[\hat{f}(x))]$? It is the variance introduced in the estimates for the parameters in your model. The linear model has the form $$ \mathbf{y}=\mathbf{X}\beta + \epsilon,\qquad \epsilon\sim\mathcal{N}(0,\sigma^2I) $$ To obtain the OLS solution we solve the minimization problem $$ \arg \min_\beta ||\mathbf{y}-\mathbf{X}\beta||^2 $$ This provides the solution $$ \hat{\beta}_{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} $$ The minimization problem for ridge regression is similar: $$ \arg \min_\beta ||\mathbf{y}-\mathbf{X}\beta||^2+\lambda||\beta||^2\qquad \lambda>0 $$ Now the solution becomes $$ \hat{\beta}_{\text{Ridge}} = (\mathbf{X}^T\mathbf{X}+\lambda I)^{-1}\mathbf{X}^T\mathbf{y} $$ So we are adding this $\lambda I$ (called the ridge) on the diagonal of the matrix that we invert. The effect this has on the matrix $\mathbf{X}^T\mathbf{X}$ is that it "pulls" the determinant of the matrix away from zero. Thus when you invert it, you do not get huge eigenvalues. But that leads to another interesting fact, namely that the variance of the parameter estimates becomes lower.
I am not sure if I can provide a more clear answer then this. What this all boils down to is the covariance matrix for the parameters in the model and the magnitude of the values in that covariance matrix.
I took ridge regression as an example, because that is much easier to treat. The lasso is much harder and there is still active ongoing research on that topic.
These slides provide some more information and this blog also has some relevant information.
EDIT: What do I mean that by adding the ridge the determinant is "pulled" away from zero?
Note that the matrix $\mathbf{X}^T\mathbf{X}$ is a positive definite symmetric matrix. Note that all symmetric matrices with real values have real eigenvalues. Also since it is positive definite, the eigenvalues are all greater than zero.
Ok so how do we calculate the eigenvalues? We solve the characteristic equation: $$ \text{det}(\mathbf{X}^T\mathbf{X}-tI)=0 $$ This is a polynomial in $t$, and as stated above, the eigenvalues are real and positive. Now let's take a look at the equation for the ridge matrix we need to invert: $$ \text{det}(\mathbf{X}^T\mathbf{X}+\lambda I-tI)=0 $$ We can change this a little bit and see: $$ \text{det}(\mathbf{X}^T\mathbf{X}-(t-\lambda)I)=0 $$ So we can solve this for $(t-\lambda)$ and get the same eigenvalues as for the first problem. Let's assume that one eigenvalue is $t_i$. So the eigenvalue for the ridge problem becomes $t_i+\lambda$. It gets shifted by $\lambda$. This happens to all the eigenvalues, so they all move away from zero.
Here is some R code to illustrate this:
# Create random matrix
A <- matrix(sample(10,9,T),nrow=3,ncol=3)
# Make a symmetric matrix
B <- A+t(A)
# Calculate eigenvalues
eigen(B)
# Calculate eigenvalues of B with ridge
eigen(B+3*diag(3))
Which gives the results:
> eigen(B)
$values
[1] 37.368634 6.952718 -8.321352
> eigen(B+3*diag(3))
$values
[1] 40.368634 9.952718 -5.321352
So all the eigenvalues get shifted up by exactly 3.
You can also prove this in general by using the Gershgorin circle theorem. There the centers of the circles containing the eigenvalues are the diagonal elements. You can always add "enough" to the diagonal element to make all the circles in the positive real half-plane. That result is more general and not needed for this.
Firstly, I think it's worth noting that the description of what ridge does assumes that the data matrix is orthonormal.
Secondly, the answer to your question is yes under those circumstances. The details may be found in "Elements of Statistical Learning" on p. 69 bis (section 3.4.3) . The short story is that
$ \beta \to \text{sign}(\beta)\max(\beta-\lambda,0)$ is the formula. Please see the book for the complete discussion, better formatting, and details.
Best Answer
Let's say that $\beta$ is the vector of coefficients to be estimated by Lasso regression.
Lasso regression applies an $L1$ penalty to the coefficients. The penalty term in the objective function is $\lambda\|\beta\|_1 = \lambda \Sigma |\beta_i|$. I.e., the absolute values of the coefficients are being penalized. Therefore, negative coefficients incur the same penalty as positive coefficients of this same magnitude. So it is the absolute values of the coefficients which are driven down toward zero.
Note that an alternative form for Lasso regression is to apply the penalty term in the form of a constraint $\Sigma |\beta_i| \le t$, for some $t > 0$, rather than as an additive term in the objective function. As with the objective function form of penalty, it is the absolute value of the estimated coefficients which are being constrained (in effect, penalized). Note that for any value of $\lambda$ in the objective function penalty term, there would be some corresponding value of $t$ in the "constraint" form such that the optimal $\beta$ would be the same.