[Math] Show that the Huber-loss based optimization is equivalent to $\ell_1$ norm based.

convex optimizationconvex-analysismachine learningoptimization

Dear optimization experts,

My apologies for asking probably the well-known relation between the Huber-loss based optimization and $\ell_1$ based optimization. However, I am stuck with a 'first-principles' based proof (without using Moreau-envelope, e.g., here) to show that they are equivalent.

Problem formulation

The observation vector is
\begin{align*}
\mathbf{y}
&= \mathbf{A}\mathbf{x} + \mathbf{z} + \mathbf{\epsilon} \\
\begin{bmatrix} y_1 \\ \vdots \\ y_N \end{bmatrix} &=
\begin{bmatrix}
\mathbf{a}_1^T\mathbf{x} + z_1 + \epsilon_1 \\
\vdots \\
\mathbf{a}_N^T\mathbf{x} + z_N + \epsilon_N
\end{bmatrix}
\end{align*}
where

  • $\mathbf{A} = \begin{bmatrix} \mathbf{a}_1^T \\ \vdots \\ \mathbf{a}_N^T \end{bmatrix} \in \mathbb{R}^{N \times M}$ is a known matrix
  • $\mathbf{x} \in \mathbb{R}^{M \times 1}$ is an unknown vector
  • $\mathbf{z} = \begin{bmatrix} z_1 \\ \vdots \\ z_N \end{bmatrix} \in \mathbb{R}^{N \times 1}$ is also unknown but sparse in nature, e.g., it can be seen as an outlier
  • $\mathbf{\epsilon} \in \mathbb{R}^{N \times 1}$ is a measurement noise say with standard Gaussian distribution having zero mean and unit variance normal, i.e. $\mathcal{N}(0,1)$.

We need to prove that the following two optimization problems P$1$ and P$2$ are equivalent.

P$1$:
\begin{align*}
\text{minimize}_{\mathbf{x},\mathbf{z}} \quad & \lVert \mathbf{y} – \mathbf{A}\mathbf{x} – \mathbf{z} \rVert_2^2 + \lambda\lVert \mathbf{z} \rVert_1
\end{align*}

and

P$2$:
\begin{align*}
\text{minimize}_{\mathbf{x}} \quad & \sum_{i=1}^{N} \mathcal{H} \left( y_i – \mathbf{a}_i^T\mathbf{x} \right),
\end{align*}
where the Huber-function $\mathcal{H}(u)$ is given as
$$\mathcal{H}(u) =
\begin{cases}
|u|^2 & |u| \leq \frac{\lambda}{2} \\
\lambda |u| – \frac{\lambda^2}{4} & |u| > \frac{\lambda}{2}
\end{cases} .
$$


My partial attempt following the suggestion in the answer below

We attempt to convert the problem P$1$ into an equivalent form by plugging the optimal solution of $\mathbf{z}$, i.e.,

\begin{align*}
\text{minimize}_{\mathbf{x},\mathbf{z}} \quad & \lVert \mathbf{y} – \mathbf{A}\mathbf{x} – \mathbf{z} \rVert_2^2 + \lambda\lVert \mathbf{z} \rVert_1 \\
\equiv
\\
\text{minimize}_{\mathbf{x}} \left\{ \text{minimize}_{\mathbf{z}} \right. \quad & \left. \lVert \mathbf{y} – \mathbf{A}\mathbf{x} – \mathbf{z} \rVert_2^2 + \lambda\lVert \mathbf{z} \rVert_1 \right\}
\end{align*}

Taking derivative with respect to $\mathbf{z}$,
\begin{align}
0 & \in \frac{\partial}{\partial \mathbf{z}} \left( \lVert \mathbf{y} – \mathbf{A}\mathbf{x} – \mathbf{z} \rVert_2^2 + \lambda\lVert \mathbf{z} \rVert_1 \right) \\
\Leftrightarrow & -2 \left( \mathbf{y} – \mathbf{A}\mathbf{x} – \mathbf{z} \right) + \lambda \partial \lVert \mathbf{z} \rVert_1 = 0 \\
\Leftrightarrow & \quad \left( \mathbf{y} – \mathbf{A}\mathbf{x} – \mathbf{z} \right) = \lambda \mathbf{v} \ .
\end{align}
for some $ \mathbf{v} \in \partial \lVert \mathbf{z} \rVert_1 $ following Ryan Tibshirani's lecture notes (slide#18-20), i.e.,
\begin{align}
v_i \in
\begin{cases}
1 & \text{if } z_i > 0 \\
-1 & \text{if } z_i < 0 \\
[-1,1] & \text{if } z_i = 0 \\
\end{cases}.
\end{align}
Then, the subgradient optimality reads:
\begin{align}
\begin{cases}
\left( y_i – \mathbf{a}_i^T\mathbf{x} – z_i \right) = \lambda \ {\rm sign}\left(z_i\right) & \text{if } z_i \neq 0 \\
\left| y_i – \mathbf{a}_i^T\mathbf{x} – z_i\right| \leq \lambda & \text{if } z_i = 0
\end{cases}
\end{align}
Also, following, Ryan Tibsharani's notes the solution should be 'soft thresholding' $$\mathbf{z} = S_{\lambda}\left( \mathbf{y} – \mathbf{A}\mathbf{x} \right),$$
where
\begin{align}
S_{\lambda}\left( y_i – \mathbf{a}_i^T\mathbf{x} \right) =
\begin{cases}
\left( y_i – \mathbf{a}_i^T\mathbf{x} – \lambda \right) & \text{if } \left(y_i – \mathbf{a}_i^T\mathbf{x}\right) > \lambda \\
0 & \text{if } -\lambda \leq \left(y_i – \mathbf{a}_i^T\mathbf{x}\right) \leq \lambda \\
\left( y_i – \mathbf{a}_i^T\mathbf{x} + \lambda \right) & \text{if } \left( y_i – \mathbf{a}_i^T\mathbf{x}\right) < -\lambda \\
\end{cases} .
\end{align}

Now, we turn to the optimization problem P$1$ such that
\begin{align*}
\text{minimize}_{\mathbf{x}} \left\{ \text{minimize}_{\mathbf{z}} \right. \quad & \left. \lVert \mathbf{y} – \mathbf{A}\mathbf{x} – \mathbf{z} \rVert_2^2 + \lambda\lVert \mathbf{z} \rVert_1 \right\} \\
\equiv
\end{align*}

\begin{align*}
\text{minimize}_{\mathbf{x}} \quad & \lVert \mathbf{y} – \mathbf{A}\mathbf{x} – S_{\lambda}\left( \mathbf{y} – \mathbf{A}\mathbf{x} \right) \rVert_2^2 + \lambda\lVert S_{\lambda}\left( \mathbf{y} – \mathbf{A}\mathbf{x} \right) \rVert_1
\end{align*}

  • if $\lvert\left(y_i – \mathbf{a}_i^T\mathbf{x}\right)\rvert \leq \lambda$, then So, $\left[S_{\lambda}\left( y_i – \mathbf{a}_i^T\mathbf{x} \right)\right] = 0$.

the objective would read as $$\text{minimize}_{\mathbf{x}} \sum_i \lvert y_i – \mathbf{a}_i^T\mathbf{x} \rvert^2, $$ which is easy to see that this matches with the Huber penalty function for this condition. Agree?

  • if $\lvert\left(y_i – \mathbf{a}_i^T\mathbf{x}\right)\rvert \geq \lambda$, then $\left( y_i – \mathbf{a}_i^T\mathbf{x} \mp \lambda \right)$.

the objective would read as $$\text{minimize}_{\mathbf{x}} \sum_i \lambda^2 + \lambda \lvert \left( y_i – \mathbf{a}_i^T\mathbf{x} \mp \lambda \right) \rvert, $$ which almost matches with the Huber function, but I am not sure how to interpret the last part, i.e., $\lvert \left( y_i – \mathbf{a}_i^T\mathbf{x} \mp \lambda \right) \rvert$. Please suggest…

Best Answer

The idea is much simpler. Use the fact that $$\min_{\mathbf{x}, \mathbf{z}} f(\mathbf{x}, \mathbf{z}) = \min_{\mathbf{x}} \left\{ \min_{\mathbf{z}} f(\mathbf{x}, \mathbf{z}) \right\}.$$ In your case, the solution of the inner minimization problem is exactly the Huber function.

Related Question