Gradient of $L(W_1, W_2, W_3) := \sum_{i=1}^N \| W_3 \ g\left(W_2 \ f\left(W_1 x_i \right) \right) – y_i \|_2^2 + \lambda ( \sum_l \| W_l\|_1)$

matrix-calculusmultivariable-calculusnonlinear optimizationoptimization

Extending this question.
How to obtain the gradient of ($\ell1$ penalized)
\begin{align}
L(W_1, W_2, W_3) := \sum_{i=1}^N \| W_3 \ g\left(W_2 \ f\left(W_1 x_i \right) \right) – y_i \|_2^2 + \lambda \left( \| W_3\|_1 + \| W_2\|_1 + \| W_1\|_1\right)\ ,
\end{align}

with respect to $W_1$, $W_2$, and $W_3$?

The definition of $x_i \in \mathbb{R}^n$, $W_1 \in \mathbb{R}^{m \times n}$, $W_2 \in \mathbb{R}^{p \times m}$, $W_3 \in \mathbb{R}^{q \times p}$, and $y_i \in \mathbb{R}^q$, and $f(z) = g(z) = \frac{1}{1 + \exp(-z)}$.


EDIT:

The gradient of the first $\ell2$ norm of the cost function is given in the link. But how to address it with $\ell1$ regularization such that one can find the optimal weights.


Thank you so much in advance for your help

Best Answer

Let $F=F(W_1,W_2,W_3)$ denote the function from your linked answer. Then this the new function is simply $$L = F + \lambda\,\Big(\|W_1\|_1 + \|W_2\|_1 + \|W_3\|_1\Big)$$ Consider what happens when you vary $W_1$ holding $(W_2,W_3)$ constant. $$\eqalign{ dL &= dF + \lambda\,\Big(d\|W_1\|_1 +0+0\Big) \cr &= \bigg(\frac{\partial F}{\partial W_1} + \lambda\,W_1(W_1^TW_1)^{-1/2}\bigg):dW_1 \cr \frac{\partial L}{\partial W_1} &= \frac{\partial F}{\partial W_1} + \lambda\,W_1(W_1^TW_1)^{-1/2} \cr }$$ where the gradient $\frac{\partial F}{\partial W_1}$ is known from the linked answer.

To calculate the other two gradients, simply repeat this process.
First, by holding $(W_1,W_3)$ constant and varying $W_2$.
Then, by holding $(W_1,W_2)$ constant and varying $W_3$.

Related Question