[Math] Understanding the difference between Ridge and LASSO

machine learningregressionregularization

I'll start by presenting my current understanding of regularized loss minimization.

In the context of ML, generally we are trying to minimize directly an empirical loss, something of the form $$\underset{h\in\mathcal{H}}{\arg\min}\sum_{i=1}^{m}L\left(y_{i},h\left(x_{i}\right)\right)$$

where there are $m$ samples $x_{1},\ldots,x_{m}$ labeled by $y_{1},\ldots,y_{m}$ and $L$ is some loss function.

In regularized loss minimization we are trying to minimize $$\underset{h\in\mathcal{H}}{\arg\min}\sum_{i=1}^{m}L\left(y_{i},h\left(x_{i}\right)\right)+\lambda\mathcal{R}\left(h\right)$$

where $\mathcal{R}:\mathcal{H}\to\mathbb{R}$ is some regularizer, which penalizes a hypothesis $h$ for it's complexity, and $\lambda$ is a parameter controlling the severity of this penalty.

In the context of linear regression, where in general we'd like to minimize $$\underset{\vec{w}\in\mathbb{R}^{d}}{\arg\min}\left\Vert \vec{y}-X^{T}\vec{w}\right\Vert _{2}^{2}$$

with ridge regularization we add the regularizer $\left\Vert \vec{w}\right\Vert _{2}^{2}$ to obtain $$\left\Vert \vec{y}-X^{T}\vec{w}\right\Vert _{2}^{2}+\lambda\left\Vert \vec{w}\right\Vert _{2}^{2}$$ And with LASSO, we use $$\left\Vert \vec{y}-X^{T}\vec{w}\right\Vert _{2}^{2}+\lambda\left\Vert \vec{w}\right\Vert _{1}$$

The LASSO solution is supposed to result in a sparse solution vector $\vec{w}$ rather then one with small entries, as is expected to happen in Ridge.

As an explanation for this last statement, I've seen the following (From "Elements of Statistical Learning"):

enter image description here

The ellipses are the contours of the least squares loss function and the blue shapes are the respective norm balls. The explanation (from my lecture notes)

“Since the $\ell_{1}$ ball has corners on the main axes, it is “more
likely” that the Lasso solution will occur on a sparse vector”

So my questions are

  1. Why is solving these problems equivalent to finding a point on the smallest possible contour that intersects the respective norm ball?

  2. Why is it more likely that this will happen on a “corner” for the $\ell_{1}$ norm?

Best Answer

I think it's more clear to first consider the optimization problem \begin{align} \text{minimize} &\quad \| w \|_1 \\ \text{subject to } &\quad \| y - X^T w \| \leq \delta. \end{align} The red curves in your picture are level curves of the function $\|y-X^T w\|$ and the outermost red curve is $\|y-X^Tx\|=\delta$. We imagine shrinking the $\ell_1$-norm ball (shown in blue) until it is as small as possible while still intersecting the constraint set. You can see in the 2D case it appears to be very likely that the point of intersection will be one of the corners of the $\ell_1$-norm ball. So the solution to this optimization problem is likely to be sparse.

We can use Lagrange multipliers to replace the constraint $\|y- X^Tw\| \leq \delta$ with a penalty term in the objective function. There exists a Lagrange multiplier $\lambda \geq 0$ such that any minimizer for the problem I wrote above is also a minimizer for the problem $$ \text{minimize} \quad \lambda (\| y-X^Tw\|-\delta) + \|w\|_1. $$ But this is equivalent to a standard Lasso problem. So, solutions to a Lasso problem are likely to be sparse.