How does this L1 regularization derivation follow? (Proof it makes sparse models)

derivativeslinear algebramachine learningnormed-spacesregularization

I'm reading the "Deep Learning"(Goodfellow et al, 2016) book and on pages 231-232(you can check them here) they show a very unique proof how L1 regularization makes model sparse. You can skip to the last two expressions for the actual question, but some context if you want it:

The regularized objetive function $\tilde{J}(\boldsymbol{w} ; \boldsymbol{X}, \boldsymbol{y})$ is given by:
$$ \tilde{J}(\boldsymbol{w} ; \boldsymbol{X}, \boldsymbol{y})=\alpha\|\boldsymbol{w}\|_{1}+J(\boldsymbol{w} ; \boldsymbol{X}, \boldsymbol{y})$$
where $\boldsymbol{w}$ is the parameter vector and $\boldsymbol{X}, \boldsymbol{y}$ are the design matrix(inputs) and the outputs of the model, respectively.

If we make a second order Taylor series approximation of the unregularized loss function of our model, which we will assume it is a linear model to ensure it has a clean analytical solution we have
$$\hat{J}(\boldsymbol{\theta})=J\left(\boldsymbol{w}^{*}\right)+\frac{1}{2}\left(\boldsymbol{w}-\boldsymbol{w}^{*}\right)^{\top} \boldsymbol{H}\left(\boldsymbol{w}-\boldsymbol{w}^{*}\right)$$
where $\boldsymbol{H}$ is the Hessian matrix of $J$ with respect to $\boldsymbol{w}$ evaluated at $\boldsymbol{w}^{*}$ and there is no first-order term in this quadratic approximation because $\boldsymbol{w}^{*}$ is defined to be a minimum.

The minimum of $\hat{J}$ occurs where its gradient
$$\nabla_{\boldsymbol{w}} \hat{J}(\boldsymbol{w})=\boldsymbol{H}\left(\boldsymbol{w}-\boldsymbol{w}^{*}\right)$$

If we make the further assumption that the Hessian is diagonal $\boldsymbol{H}=\operatorname{diag}\left(\left[H_{1,1}, \ldots, H_{n, n}\right]\right), $ where each $ H_{i, i}>0$ (more details on the book), we have that our quadratic approximation of the L1 regularized objective function decomposes into a sum over the parameters:
$$\hat{J}(\boldsymbol{w} ; \boldsymbol{X}, \boldsymbol{y})=J\left(\boldsymbol{w}^{*} ; \boldsymbol{X}, \boldsymbol{y}\right)+\sum_{i}\left[\frac{1}{2} H_{i, i}\left(\boldsymbol{w}_{i}-\boldsymbol{w}_{i}^{*}\right)^{2}+\alpha\left|w_{i}\right|\right]$$

Then they say (and this is where I don't see how they did it, specifically that max function)

The problem of minimizing this approximate cost function has an analytical solution(for each dimension i), with the following form:
$$w_{i}=\operatorname{sign}\left(w_{i}^{*}\right) \max \left\{\left|w_{i}^{*}\right|-\frac{\alpha}{H_{i, i}}, 0\right\}$$

How did they get that expression? I've reached a similar one, but without that max function…

For further insights, please refer to the book.

Best Answer

Below is a derivation using the method described in this video.

As mentioned by @angryavian we want to minimize $𝑔(𝑀_𝑖):=\frac{1}{2}𝐻_{𝑖𝑖}(𝑀_π‘–βˆ’π‘€^βˆ—_𝑖)^2+𝛼|𝑀_𝑖|$

We achieve this by taking the derivative of $g(w_i)$ with respect to $w_i$ for each dimension $i$ and setting it to $0$

${g^\prime(w_i)} = H_{ii}(w_i - w^*_i) + 𝛼 \cdot \partial f(w_i)$

where $f(x)=|x|$ and $\partial f$ is the subgradient (or subderivative since $x$ in this case is a scalar) of $f$ and therefore $\partial f(x)=\operatorname{sign}(x)$ where

\begin{equation} \operatorname{sign}(x)=\begin{cases} 1 & \text{if $x>0$}\\ [-1, 1] & \text{if $x=0$}\\ -1 & \text{if $x<0$}\\ \end{cases} \end{equation}

Therefore setting the derivative to zero results in:

${g^\prime(w_i)} = H_{ii}(w_i - w^*_i) + 𝛼 \cdot \operatorname{sign}(w_i)=0$

$\Rightarrow w_i=w^*_i - \frac{𝛼}{H_{ii}} \cdot \operatorname{sign}(w_i)$

$\Rightarrow \begin{equation} w_i=\begin{cases} w^*_i-\frac{𝛼}{H_{ii}} & \text{if $w_i>0$}\\ w^*_i-\frac{𝛼}{H_{ii}}[-1, 1] & \text{if $w_i=0$}\\ w^*_i+\frac{𝛼}{H_{ii}} & \text{if $w_i<0$}\\ \end{cases} \end{equation}$

$w_i>0 \Rightarrow w^*_i-\frac{𝛼}{H_{ii}}>0 \Rightarrow w^*_i > \frac{𝛼}{H_{ii}}$

$w_i=0 \Rightarrow w^*_i-\frac{𝛼}{H_{ii}}[-1, 1]=0 \Rightarrow |w^*_i| <= \frac{𝛼}{H_{ii}}$

$w_i<0 \Rightarrow w^*_i+\frac{𝛼}{H_{ii}}<0 \Rightarrow w^*_i < -\frac{𝛼}{H_{ii}}$

Therefore we can rewrite the minimum $w_i$ as

$\Rightarrow \begin{equation} w_i=\begin{cases} w^*_i-\frac{𝛼}{H_{ii}} & \text{if $w^*_i > \frac{𝛼}{H_{ii}}$}\\ 0 & \text{if $|w^*_i| <= \frac{𝛼}{H_{ii}}$}\\ w^*_i+\frac{𝛼}{H_{ii}} & \text{if $w^*_i < -\frac{𝛼}{H_{ii}}$}\\ \end{cases} \end{equation}$

From the above we can see that $w_i$ has taken the form of a Soft-Thresholding operator which can be rewritten as:

$w_i=\operatorname{sign}^\prime(w^*_i) \cdot \max(|w^*_i|-\frac{𝛼}{H_{ii}},0)$ where

\begin{equation} \operatorname{sign}^\prime(x)=\begin{cases} 1 & \text{if $x>0$}\\ 0 & \text{if $x=0$}\\ -1 & \text{if $x<0$}\\ \end{cases} \end{equation}

You can verify that the two forms of $w_i$ are equivalent since $H_{ii}$ is positive (mentioned in the book) and $𝛼$ (regularization hyper parameter) is greater than or equal to zero. For example let's assume the case when $w^*_i > 0$:

$w^*_i > \frac{𝛼}{H_{ii}}\Rightarrow w^*_i - \frac{𝛼}{H_{ii}} > 0$

therefore

$\operatorname{sign}^\prime(w^*_i) \cdot \max(|w^*_i|-\frac{𝛼}{H_{ii}},0) = \max(w^*_i-\frac{𝛼}{H_{ii}},0) = w^*_i-\frac{𝛼}{H_{ii}}$

Similarly, other cases can be verified.

Related Question