[Math] Understanding the difference between Ridge and LASSO

machine learningregressionregularization

I'll start by presenting my current understanding of regularized loss minimization.

In the context of ML, generally we are trying to minimize directly an empirical loss, something of the form $$\underset{h\in\mathcal{H}}{\arg\min}\sum_{i=1}^{m}L\left(y_{i},h\left(x_{i}\right)\right)$$

where there are $m$ samples $x_{1},\ldots,x_{m}$ labeled by $y_{1},\ldots,y_{m}$ and $L$ is some loss function.

In regularized loss minimization we are trying to minimize $$\underset{h\in\mathcal{H}}{\arg\min}\sum_{i=1}^{m}L\left(y_{i},h\left(x_{i}\right)\right)+\lambda\mathcal{R}\left(h\right)$$

where $\mathcal{R}:\mathcal{H}\to\mathbb{R}$ is some regularizer, which penalizes a hypothesis $h$ for it's complexity, and $\lambda$ is a parameter controlling the severity of this penalty.

In the context of linear regression, where in general we'd like to minimize $$\underset{\vec{w}\in\mathbb{R}^{d}}{\arg\min}\left\Vert \vec{y}-X^{T}\vec{w}\right\Vert _{2}^{2}$$

with ridge regularization we add the regularizer $\left\Vert \vec{w}\right\Vert _{2}^{2}$ to obtain $$\left\Vert \vec{y}-X^{T}\vec{w}\right\Vert _{2}^{2}+\lambda\left\Vert \vec{w}\right\Vert _{2}^{2}$$ And with LASSO, we use $$\left\Vert \vec{y}-X^{T}\vec{w}\right\Vert _{2}^{2}+\lambda\left\Vert \vec{w}\right\Vert _{1}$$

The LASSO solution is supposed to result in a sparse solution vector $\vec{w}$ rather then one with small entries, as is expected to happen in Ridge.

As an explanation for this last statement, I've seen the following (From "Elements of Statistical Learning"):

The ellipses are the contours of the least squares loss function and the blue shapes are the respective norm balls. The explanation (from my lecture notes)

“Since the $\ell_{1}$ ball has corners on the main axes, it is “more
likely” that the Lasso solution will occur on a sparse vector”

So my questions are

Why is solving these problems equivalent to finding a point on the smallest possible contour that intersects the respective norm ball?
Why is it more likely that this will happen on a “corner” for the $\ell_{1}$ norm?

Best Answer

I think it's more clear to first consider the optimization problem \begin{align} \text{minimize} &\quad \| w \|_1 \\ \text{subject to } &\quad \| y - X^T w \| \leq \delta. \end{align} The red curves in your picture are level curves of the function $\|y-X^T w\|$ and the outermost red curve is $\|y-X^Tx\|=\delta$. We imagine shrinking the $\ell_1$-norm ball (shown in blue) until it is as small as possible while still intersecting the constraint set. You can see in the 2D case it appears to be very likely that the point of intersection will be one of the corners of the $\ell_1$-norm ball. So the solution to this optimization problem is likely to be sparse.

We can use Lagrange multipliers to replace the constraint $\|y- X^Tw\| \leq \delta$ with a penalty term in the objective function. There exists a Lagrange multiplier $\lambda \geq 0$ such that any minimizer for the problem I wrote above is also a minimizer for the problem $$ \text{minimize} \quad \lambda (\| y-X^Tw\|-\delta) + \|w\|_1. $$ But this is equivalent to a standard Lasso problem. So, solutions to a Lasso problem are likely to be sparse.

Problem I

$$ \arg \min_{x} \frac{1}{2} \left\| A x - b \right\|_{2}^{2} + \lambda \left\| x \right\|_{p} $$

Problem II

$$ \arg \min_{x} \frac{1}{2} \left\| A x - b \right\|_{2}^{2} + \lambda \left\| x \right\|_{p}^{p} $$

Solution Problem I

The function is given by:

$$ f \left( x \right) = \frac{1}{2} \left\| A x - b \right\|_{2}^{2} + \lambda \left\| x \right\|_{p} $$

The derivative is given by:

$$ \frac{d}{d x} f \left( x \right) = {A}^{T} \left( A x - b \right) + \lambda p X x $$

Where the matrix $ X $ is a diagonal matrix given by:

$$ {X}_{ii} = \left| {x}_{i} \right|^{p - 2} \frac{\left\| x \right\|_{p}^{ \frac{1}{p} - 1 }}{p} $$

The derivative vanishes at:

$$ x = \left( {A}^{T} A + \lambda p X \right)^{-1} {A}^{T} b $$

Since $ X $ dpends on $ x $ the method to solve it is using the Iterative Reweighted Least Squares (IRLS):

$$ {x}^{k + 1} = \left( {A}^{T} A + \lambda p {X}^{k} \right)^{-1} {A}^{T} b $$

The Code:

%% Solution by Iterative Reweighted Least Squares (IRLS) - Problem I

hObjFun = @(vX) (0.5 * sum((mA * vX - vB) .^ 2)) + (paramLambda * norm(vX, paramP));
vObjVal = zeros([numIterations, 1]);

mAA = mA.' * mA;
vAb = mA.' * vB;

vX          = mA \ vB; %<! Initialization by the Least Squares Solution
vObjVal(1)  = hObjFun(vX);

for ii = 2:numIterations

    mX = diag((sum(abs(vX) .^ paramP) .^ ((1 / paramP) - 1)) .* abs(vX) .^ (paramP - 2));

    vX = (mAA + (paramLambda * mX)) \ vAb;

    vObjVal(ii) = hObjFun(vX);
end

Solution Problem II

The function is given by:

$$ f \left( x \right) = \frac{1}{2} \left\| A x - b \right\|_{2}^{2} + \lambda \left\| x \right\|_{p}^{p} $$

The derivative is given by:

$$ \frac{d}{d x} f \left( x \right) = {A}^{T} \left( A x - b \right) + \lambda p X x $$

Where the matrix $ X $ is a diagonal matrix given by:

$$ {X}_{ii} = \left| {x}_{i} \right|^{p - 2} $$

The derivative vanishes at:

$$ x = \left( {A}^{T} A + \lambda p X \right)^{-1} {A}^{T} b $$

Since $ X $ dpends on $ x $ the method to solve it is using the Iterative Reweighted Least Squares (IRLS):

$$ {x}^{k + 1} = \left( {A}^{T} A + \lambda p {X}^{k} \right)^{-1} {A}^{T} b $$

Where:

$$ {X}_{ii}^{k} = \left| {x}_{i}^{k} \right|^{p - 2} $$

The Code is given by:

%% Solution by Iterative Reweighted Least Squares (IRLS)

hObjFun = @(vX) (0.5 * sum((mA * vX - vB) .^ 2)) + (paramLambda * sum(abs(vX) .^ paramP));
vObjVal = zeros([numIterations, 1]);

mAA = mA.' * mA;
vAb = mA.' * vB;

vX          = mA \ vB; %<! Initialization by the Least Squares Solution
vObjVal(1)  = hObjFun(vX);

for ii = 2:numIterations

    mX = diag(abs(vX) .^ (paramP - 2));

    vX = (mAA + (paramLambda * paramP * mX)) \ vAb;

    vObjVal(ii) = hObjFun(vX);
end

The code is available (Including validation by CVX) at my StackExchange Mathematics Q2403596 GitHub Repository.

Best Answer

Related Solutions

Optimization – Linear Regression for Minimizing Maximum Residuals

[Math] Least Square Linear Regression with P Norm Regularization Where $ 1 \leq P \leq 2 $

Problem I

Problem II

Solution Problem I

Solution Problem II

Related Question