This is regarding the variance
OLS provides what is called the Best Linear Unbiased Estimator (BLUE). That means that if you take any other unbiased estimator, it is bound to have a higher variance then the OLS solution. So why on earth should we consider anything else than that?
Now the trick with regularization, such as the lasso or ridge, is to add some bias in turn to try to reduce the variance. Because when you estimate your prediction error, it is a combination of three things:
$$
\text{E}[(y-\hat{f}(x))^2]=\text{Bias}[\hat{f}(x))]^2
+\text{Var}[\hat{f}(x))]+\sigma^2
$$
The last part is the irreducible error, so we have no control over that. Using the OLS solution the bias term is zero. But it might be that the second term is large. It might be a good idea, (if we want good predictions), to add in some bias and hopefully reduce the variance.
So what is this $\text{Var}[\hat{f}(x))]$? It is the variance introduced in the estimates for the parameters in your model. The linear model has the form
$$
\mathbf{y}=\mathbf{X}\beta + \epsilon,\qquad \epsilon\sim\mathcal{N}(0,\sigma^2I)
$$
To obtain the OLS solution we solve the minimization problem
$$
\arg \min_\beta ||\mathbf{y}-\mathbf{X}\beta||^2
$$
This provides the solution
$$
\hat{\beta}_{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}
$$
The minimization problem for ridge regression is similar:
$$
\arg \min_\beta ||\mathbf{y}-\mathbf{X}\beta||^2+\lambda||\beta||^2\qquad \lambda>0
$$
Now the solution becomes
$$
\hat{\beta}_{\text{Ridge}} = (\mathbf{X}^T\mathbf{X}+\lambda I)^{-1}\mathbf{X}^T\mathbf{y}
$$
So we are adding this $\lambda I$ (called the ridge) on the diagonal of the matrix that we invert. The effect this has on the matrix $\mathbf{X}^T\mathbf{X}$ is that it "pulls" the determinant of the matrix away from zero. Thus when you invert it, you do not get huge eigenvalues. But that leads to another interesting fact, namely that the variance of the parameter estimates becomes lower.
I am not sure if I can provide a more clear answer then this. What this all boils down to is the covariance matrix for the parameters in the model and the magnitude of the values in that covariance matrix.
I took ridge regression as an example, because that is much easier to treat. The lasso is much harder and there is still active ongoing research on that topic.
These slides provide some more information and this blog also has some relevant information.
EDIT: What do I mean that by adding the ridge the determinant is "pulled" away from zero?
Note that the matrix $\mathbf{X}^T\mathbf{X}$ is a positive definite symmetric matrix. Note that all symmetric matrices with real values have real eigenvalues. Also since it is positive definite, the eigenvalues are all greater than zero.
Ok so how do we calculate the eigenvalues? We solve the characteristic equation:
$$
\text{det}(\mathbf{X}^T\mathbf{X}-tI)=0
$$
This is a polynomial in $t$, and as stated above, the eigenvalues are real and positive. Now let's take a look at the equation for the ridge matrix we need to invert:
$$
\text{det}(\mathbf{X}^T\mathbf{X}+\lambda I-tI)=0
$$
We can change this a little bit and see:
$$
\text{det}(\mathbf{X}^T\mathbf{X}-(t-\lambda)I)=0
$$
So we can solve this for $(t-\lambda)$ and get the same eigenvalues as for the first problem. Let's assume that one eigenvalue is $t_i$. So the eigenvalue for the ridge problem becomes $t_i+\lambda$. It gets shifted by $\lambda$. This happens to all the eigenvalues, so they all move away from zero.
Here is some R code to illustrate this:
# Create random matrix
A <- matrix(sample(10,9,T),nrow=3,ncol=3)
# Make a symmetric matrix
B <- A+t(A)
# Calculate eigenvalues
eigen(B)
# Calculate eigenvalues of B with ridge
eigen(B+3*diag(3))
Which gives the results:
> eigen(B)
$values
[1] 37.368634 6.952718 -8.321352
> eigen(B+3*diag(3))
$values
[1] 40.368634 9.952718 -5.321352
So all the eigenvalues get shifted up by exactly 3.
You can also prove this in general by using the Gershgorin circle theorem. There the centers of the circles containing the eigenvalues are the diagonal elements. You can always add "enough" to the diagonal element to make all the circles in the positive real half-plane. That result is more general and not needed for this.
How bridge regression and elastic net differ is a fascinating question, given their similar-looking penalties. Here's one possible approach. Suppose we solve the bridge regression problem. We can then ask how the elastic net solution would differ. Looking at the gradients of the two loss functions can tell us something about this.
Bridge regression
Say $X$ is a matrix containing values of the independent variable ($n$ points x $d$ dimensions), $y$ is a vector containing values of the dependent variable, and $w$ is the weight vector.
The loss function penalizes the $\ell_q$ norm of the weights, with magnitude $\lambda_b$:
$$
L_b(w)
= \| y - Xw\|_2^2
+ \lambda_b \|w\|_q^q
$$
The gradient of the loss function is:
$$
\nabla_w L_b(w)
= -2 X^T (y - Xw)
+ \lambda_b q |w|^{\circ(q-1)} \text{sgn}(w)
$$
$v^{\circ c}$ denotes the Hadamard (i.e. element-wise) power, which gives a vector whose $i$th element is $v_i^c$. $\text{sgn}(w)$ is the sign function (applied to each element of $w$). The gradient may be undefined at zero for some values of $q$.
Elastic net
The loss function is:
$$
L_e(w)
= \|y - Xw\|_2^2
+ \lambda_1 \|w\|_1
+ \lambda_2 \|w\|_2^2
$$
This penalizes the $\ell_1$ norm of the weights with magnitude $\lambda_1$ and the $\ell_2$ norm with magnitude $\lambda_2$. The elastic net paper calls minimizing this loss function the 'naive elastic net' because it doubly shrinks the weights. They describe an improved procedure where the weights are later rescaled to compensate for the double shrinkage, but I'm just going to analyze the naive version. That's a caveat to keep in mind.
The gradient of the loss function is:
$$
\nabla_w L_e(w)
= -2 X^T (y - Xw)
+ \lambda_1 \text{sgn}(w)
+ 2 \lambda_2 w
$$
The gradient is undefined at zero when $\lambda_1 > 0$ because the absolute value in the $\ell_1$ penalty isn't differentiable there.
Approach
Say we select weights $w^*$ that solve the bridge regression problem. This means the the bridge regression gradient is zero at this point:
$$
\nabla_w L_b(w^*)
= -2 X^T (y - Xw^*)
+ \lambda_b q |w^*|^{\circ (q-1)} \text{sgn}(w^*)
= \vec{0}
$$
Therefore:
$$
2 X^T (y - Xw^*)
= \lambda_b q |w^*|^{\circ (q-1)} \text{sgn}(w^*)
$$
We can substitute this into the elastic net gradient, to get an expression for the elastic net gradient at $w^*$. Fortunately, it no longer depends directly on the data:
$$
\nabla_w L_e(w^*)
= \lambda_1 \text{sgn}(w^*)
+ 2 \lambda_2 w^*
-\lambda_b q |w^*|^{\circ (q-1)} \text{sgn}(w^*)
$$
Looking at the elastic net gradient at $w^*$ tells us: Given that bridge regression has converged to weights $w^*$, how would the elastic net want to change these weights?
It gives us the local direction and magnitude of the desired change, because the gradient points in the direction of steepest ascent and the loss function will decrease as we move in the direction opposite to the gradient. The gradient might not point directly toward the elastic net solution. But, because the elastic net loss function is convex, the local direction/magnitude gives some information about how the elastic net solution will differ from the bridge regression solution.
Case 1: Sanity check
($\lambda_b = 0, \lambda_1 = 0, \lambda_2 = 1$). Bridge regression in this case is equivalent to ordinary least squares (OLS), because the penalty magnitude is zero. The elastic net is equivalent ridge regression, because only the $\ell_2$ norm is penalized. The following plots show different bridge regression solutions and how the elastic net gradient behaves for each.
Left plot: Elastic net gradient vs. bridge regression weight along each dimension
The x axis represents one component of a set of weights $w^*$ selected by bridge regression. The y axis represents the corresponding component of the elastic net gradient, evaluated at $w^*$. Note that the weights are multidimensional, but we're just looking at the weights/gradient along a single dimension.
Right plot: Elastic net changes to bridge regression weights (2d)
Each point represents a set of 2d weights $w^*$ selected by bridge regression. For each choice of $w^*$, a vector is plotted pointing in the direction opposite the elastic net gradient, with magnitude proportional to that of the gradient. That is, the plotted vectors show how the elastic net wants to change the bridge regression solution.
These plots show that, compared to bridge regression (OLS in this case), elastic net (ridge regression in this case) wants to shrink weights toward zero. The desired amount of shrinkage increases with the magnitude of the weights. If the weights are zero, the solutions are the same. The interpretation is that we want to move in the direction opposite to the gradient to reduce the loss function. For example, say bridge regression converged to a positive value for one of the weights. The elastic net gradient is positive at this point, so elastic net wants to decrease this weight. If using gradient descent, we'd take steps proportional in size to the gradient (of course, we can't technically use gradient descent to solve the elastic net because of the non-differentiability at zero, but subgradient descent would give numerically similar results).
Case 2: Matching bridge & elastic net
($q = 1.4, \lambda_b = 1, \lambda_1 = 0.629, \lambda_2 = 0.355$). I chose the bridge penalty parameters to match the example from the question. I chose the elastic net parameters to give the best matching elastic net penalty. Here, best-matching means, given a particular distribution of weights, we find the elastic net penalty parameters that minimize the expected squared difference between the bridge and elastic net penalties:
$$
\min_{\lambda_1, \lambda_2} \enspace
E \left [ (
\lambda_1 \|w\|_1 + \lambda_2 \|w\|_2^2
- \lambda_b \|w\|_q^q
)^2 \right ]
$$
Here, I considered weights with all entries drawn i.i.d. from the uniform distribution on $[-2, 2]$ (i.e. within a hypercube centered at the origin). The best-matching elastic net parameters were similar for 2 to 1000 dimensions. Although they don't appear to be sensitive to the dimensionality, the best-matching parameters do depend on the scale of the distribution.
Penalty surface
Here's a contour plot of the total penalty imposed by bridge regression ($q=1.4, \lambda_b=100$) and best-matching elastic net ($\lambda_1 = 0.629, \lambda_2 = 0.355$) as a function of the weights (for the 2d case):
Gradient behavior
We can see the following:
- Let $w^*_j$ be the chosen bridge regression weight along dimension $j$.
- If $|w^*_j|< 0.25$, elastic net wants to shrink the weight toward zero.
- If $|w^*_j| \approx 0.25$, the bridge regression and elastic net solutions are the same. But, elastic net wants to move away if the weight differs even slightly.
- If $0.25 < |w^*_j| < 1.31$, elastic net wants to grow the weight.
- If $|w^*_j| \approx 1.31$, the bridge regression and elastic net solutions are the same. Elastic net wants to move toward this point from nearby weights.
- If $|w^*_j| > 1.31$, elastic net wants to shrink the weight.
The results are qualitatively similar if we change the the value of $q$ and/or $\lambda_b$ and find the corresponding best $\lambda_1, \lambda_2$. The points where the bridge and elastic net solutions coincide change slightly, but the behavior of the gradients are otherwise similar.
Case 3: Mismatched bridge & elastic net
$(q=1.8, \lambda_b=1, \lambda_1=0.765, \lambda_2 = 0.225)$. In this regime, bridge regression behaves similar to ridge regression. I found the best-matching $\lambda_1, \lambda_2$, but then swapped them so that the elastic net behaves more like lasso ($\ell_1$ penalty greater than $\ell_2$ penalty).
Relative to bridge regression, elastic net wants to shrink small weights toward zero and increase larger weights. There's a single set of weights in each quadrant where the bridge regression and elastic net solutions coincide, but elastic net wants to move away from this point if the weights differ even slightly.
$(q=1.2, \lambda_b=1, \lambda_1=173, \lambda_2 = 0.816)$. In this regime, the bridge penalty is more similar to an $\ell_1$ penalty (although bridge regression may not produce sparse solutions with $q > 1$, as mentioned in the elastic net paper). I found the best-matching $\lambda_1, \lambda_2$, but then swapped them so that the elastic net behaves more like ridge regression ($\ell_2$ penalty greater than $\ell_1$ penalty).
Relative to bridge regression, elastic net wants to grow small weights and shrink larger weights. There's a point in each quadrant where the bridge regression and elastic net solutions coincide, and elastic net wants to move toward these weights from neighboring points.
Best Answer
Let's consider a very simple model: $y = \beta x + e$, with an L1 penalty on $\hat{\beta}$ and a least-squares loss function on $\hat{e}$. We can expand the expression to be minimized as:
$\min y^Ty -2 y^Tx\hat{\beta} + \hat{\beta} x^Tx\hat{\beta} + 2\lambda|\hat{\beta}|$
Keep in mind this is a univariate example, with $\beta$ and $x$ being scalars, to show how LASSO can send a coefficient to zero. This can be generalized to the multivariate case.
Let us assume the least-squares solution is some $\hat{\beta} > 0$, which is equivalent to assuming that $y^Tx > 0$, and see what happens when we add the L1 penalty. With $\hat{\beta}>0$, $|\hat{\beta}| = \hat{\beta}$, so the penalty term is equal to $2\lambda\beta$. The derivative of the objective function w.r.t. $\hat{\beta}$ is:
$-2y^Tx +2x^Tx\hat{\beta} + 2\lambda$
which evidently has solution $\hat{\beta} = (y^Tx - \lambda)/(x^Tx)$.
Obviously by increasing $\lambda$ we can drive $\hat{\beta}$ to zero (at $\lambda = y^Tx$). However, once $\hat{\beta} = 0$, increasing $\lambda$ won't drive it negative, because, writing loosely, the instant $\hat{\beta}$ becomes negative, the derivative of the objective function changes to:
$-2y^Tx +2x^Tx\hat{\beta} - 2\lambda$
where the flip in the sign of $\lambda$ is due to the absolute value nature of the penalty term; when $\beta$ becomes negative, the penalty term becomes equal to $-2\lambda\beta$, and taking the derivative w.r.t. $\beta$ results in $-2\lambda$. This leads to the solution $\hat{\beta} = (y^Tx + \lambda)/(x^Tx)$, which is obviously inconsistent with $\hat{\beta} < 0$ (given that the least squares solution $> 0$, which implies $y^Tx > 0$, and $\lambda > 0$). There is an increase in the L1 penalty AND an increase in the squared error term (as we are moving farther from the least squares solution) when moving $\hat{\beta}$ from $0$ to $ < 0$, so we don't, we just stick at $\hat{\beta}=0$.
It should be intuitively clear the same logic applies, with appropriate sign changes, for a least squares solution with $\hat{\beta} < 0$.
With the least squares penalty $\lambda\hat{\beta}^2$, however, the derivative becomes:
$-2y^Tx +2x^Tx\hat{\beta} + 2\lambda\hat{\beta}$
which evidently has solution $\hat{\beta} = y^Tx/(x^Tx + \lambda)$. Obviously no increase in $\lambda$ will drive this all the way to zero. So the L2 penalty can't act as a variable selection tool without some mild ad-hockery such as "set the parameter estimate equal to zero if it is less than $\epsilon$".
Obviously things can change when you move to multivariate models, for example, moving one parameter estimate around might force another one to change sign, but the general principle is the same: the L2 penalty function can't get you all the way to zero, because, writing very heuristically, it in effect adds to the "denominator" of the expression for $\hat{\beta}$, but the L1 penalty function can, because it in effect adds to the "numerator".