[Math] Proximal Operator – $ {L}_{1} $ Norm of a Linear Term – Generalization of Soft Threshold Operator

convex optimizationconvex-analysisoptimizationproximal-operators

For certain $\ell_1$-regularized optimization problems, a critical computational step is the soft threshold operator:

$\mathcal{S}_t(x) = \mathrm{sgn}(x)\circ \mathrm{max}(|x|-t)$

where $\circ$ is element-wise multiplication, or the Hadamard product. This operator is the solution to the minimization problem

$\arg\min_z t||z||_1+\frac{1}{2}||z-x||^2_2$

which makes it a kind of proximity operator. What I'd like to know is, is there a closed form solution for the proximity operator of $h(x) = ||xF||_1$ for arbitrary matrix $F$? That is, is there a closed form solution to

$\arg\min_z t||zF||_1 + \frac{1}{2}||z-x||^2_2$

? Thanks.

Best Answer

$\newcommand{\prox}{\operatorname{prox}}$

No, there is no closed form solution unless F is semi-orthogonal or orthogonal. The way to solve it is to notice that this problem is actually a generalized Lasso problem. Recall, that the general form of generalized Lasso is:

$\min$ $\lambda\|Fx\| + (1/2)\|Ax-b\|^2_2$

We can then introduce a new variable $z$ and solve it using ADMM as described in Stephen Boyd, Neal Parikh - Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers:

$\min$ $\lambda\|z\| + (1/2)\|x-b\|^2_2$

s.t. $Fx - z = 0$

ADMM is a standard algorithm used for convex optimization; in the general form, it can be written as:

$\min$ $f(x) + g(z)$

s.t. $Ax + Bz = c$

where $x \in R^n$, $z \in R^m$, $A \in R^{p\times n}$, $B \in R^{p \times m}$ and $c \in R^p$. We also make an assumption that $f(x)$ and $g(z)$ are convex. The idea of the procedure is rather standard: write augmented Lagrangian for the convex optimization problem above; minimize Lagrangian w.r.t. $x$ keeping $z$ and $y$ (Lagrangian multiplier) fixed; then minimize it w.r.t $z$ keeping $x$ and $y$ fixed; and finally perform dual update for $y$.

Analytic Solution

Remark
This derivation is extension of dohmatob's solution (Extending details not given in the linked PDF).

Defining:

$$ \hat{x} = \operatorname{prox}_{{\lambda}_{1} {\left\| \cdot \right\|}_{1} + {\lambda}_{2} {\left\| \cdot \right\|}_{2}} \left( b \right) = \arg \min_{x} \left\{ \frac{1}{2} {\left\| x - b \right\|}_{2}^{2} + {\lambda}_{1} {\left\| x \right\|}_{1} + {\lambda}_{2} {\left\| x \right\|}_{2} \right\} $$

This implies:

$$ 0 \in \hat{x} - b + {\lambda}_{1} \partial {\left\| \hat{x} \right\|}_{1} + {\lambda}_{2} \partial {\left\| \hat{x} \right\|}_{2} $$

Where:

$$ u \in \partial {\left\| \cdot \right\|}_{1} \left( \hat{x}_{i} \right) = \begin{cases} \left[-1, 1 \right] & \text{ if } \hat{x}_{i} = 0 \\ \operatorname{sgn}\left( \hat{x}_{i} \right) & \text{ if } \hat{x}_{i} \neq 0 \end{cases} , \; v \in \partial {\left\| \cdot \right\|}_{2} \left( x \right) = \begin{cases} \left\{ z \mid \left\| z \right\|_{2} \leq 1 \right\} & \text{ if } \hat{x} = \boldsymbol{0} \\ \frac{ \hat{x} }{ \left\| \hat{x} \right\|_{2} } & \text{ if } \hat{x} \neq \boldsymbol{0} \end{cases} $$

Notes

The optimization problem tries to minimize $ \hat{x} $ norms while keeping it close to $ b $.
For any element which is not zero in $ \hat{x} $ its sign is identical to the corresponding element in $ b $. Namely $ \forall i \in \left\{ j \mid \hat{x}_{j} \neq 0 \right\}, \, \operatorname{sgn} \left( \hat{x}_{i} \right) = \operatorname{sgn} \left( b \right) $. The reason is simple, If $ \operatorname{sgn} \left( \hat{x}_{i} \right) \neq \operatorname{sgn} \left( b \right) $ then by setting $ \hat{x}_{i} = -\hat{x}_{i} $ one could minimize the distance to $ b $ while keeping the norms the same which is a contradiction to the $ \hat{x} $ being optimal.

Case $ \hat{x} = \boldsymbol{0} $

In this case the above suggests:

$$ b = {\lambda}_{1} u + {\lambda}_{2} v \iff b - {\lambda}_{1} u = {\lambda}_{2} v $$

Since $ {u}_{i} \in \left[ -1, 1 \right] $ and $ \left\| v \right\|_{2} \leq 1 $ one could see that as long as $ \left\| b - {\lambda}_{1} u \right\|_{2} \leq {\lambda}_{2} $ one could set $ \hat{x} = \boldsymbol{0} $ while equality of the constraints hold. Looking for the edge cases (With regard to $ b $) is simple since it cam be done element wise between $ b $ and $ u $. It indeed happens when $ v = \operatorname{sign}\left( b \right) $ which yields:

$$ \hat{x} = \boldsymbol{0} \iff \left\| b - {\lambda}_{1} \operatorname{sign} \left( b \right) \right\|_{2} \leq {\lambda}_{2} \iff \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} \leq {\lambda}_{2} $$

Where $ \mathcal{S}_{ \lambda } \left( \cdot \right) $ is the Soft Threshold function with parameter $ \lambda $.

Case $ \hat{x} \neq \boldsymbol{0} $

In this case the above suggests:

$$ \begin{align*} 0 & = \hat{x} - b + {\lambda}_{1} u + {\lambda}_{2} \frac{ \hat{x} }{ \left\| \hat{x} \right\|_{2} } \\ & \iff b - {\lambda}_{1} u = \left( 1 + \frac{ {\lambda}_{2} }{ \left\| \hat{x} \right\|_{2} } \right) \hat{x} \end{align*} $$

For elements where $ {x}_{i} = 0 $ it means $ \left| {b}_{i} \right| \leq {\lambda}_{1} $. Namely $ \forall i \in \left\{ j \mid \hat{x}_{j} = 0 \right\}, \, {b}_{i} - {\lambda}_{1} v = 0 \iff \left| {b}_{i} \right| \leq {\lambda}_{1} $. This comes from the fact $ {v}_{i} \in \left[ -1, 1 \right] $.

This makes the left hand side of the equation to be a Threhsolding Operator, hence:

As written in notes Under the assumption $ \forall i, \, \operatorname{sign} \left( \hat{x}_{i} \right) = \operatorname{sign} \left( {b}_{i} \right) $ the above becomes:

$$ \mathcal{S}_{ {\lambda}_{1} } \left( b \right) = \left( 1 + \frac{ {\lambda}_{2} }{ \left\| \hat{x} \right\|_{2} } \right) \hat{x} $$

Looking on the $ {L}_{2} $ Norm of both equation sides yields:

$$ \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} = \left( 1 + \frac{ {\lambda}_{2} }{ \left\| \hat{x} \right\|_{2} } \right) \left\| \hat{x} \right\|_{2} \Rightarrow \left\| \hat{x} \right\|_{2} = \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} - {\lambda}_{2} $$

Plugging this into the above yields:

$$ \hat{x} = \frac{ \mathcal{S}_{ {\lambda}_{1} } \left( b \right) }{ 1 + \frac{ {\lambda}_{2} }{ \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} - {\lambda}_{2} } } = \left( 1 - \frac{ {\lambda}_{2} }{ \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} } \right) \mathcal{S}_{ {\lambda}_{1} } \left( b \right) $$

Remembering that in this case it is guaranteed that $ {\lambda}_{2} < \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} $ hence the term in the braces is positive as needed.

Summary

The solution is given by:

$$ \begin{align*} \hat{x} = \operatorname{prox}_{{\lambda}_{1} {\left\| \cdot \right\|}_{1} + {\lambda}_{2} {\left\| \cdot \right\|}_{2}} \left( b \right) & = \begin{cases} \boldsymbol{0} & \text{ if } \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} \leq {\lambda}_{2} \\ \left( 1 - \frac{ {\lambda}_{2} }{ \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} } \right) \mathcal{S}_{ {\lambda}_{1} } \left( b \right) & \text{ if } \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} > {\lambda}_{2} \end{cases} \\ & = \left( 1 - \frac{ {\lambda}_{2} }{ \max \left\{ \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2}, {\lambda}_{2} \right\} } \right) \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \\ & = \operatorname{prox}_{ {\lambda}_{2} {\left\| \cdot \right\|}_{2} } \left( \operatorname{prox}_{ {\lambda}_{1} {\left\| \cdot \right\|}_{1} } \left( b\right)\right) \end{align*} $$

This matches the derivation in the paper On Decomposing the Proximal Map (See Lecture Video - On Decomposing the Proximal Map) mentioned by @littleO.

Solving as an Optimization Problem

This section will illustrate 3 different methods to the above problem (Very similar to Elastic Net Regularization).

Sub Gradient Method

The Sub Gradient of the above is given by:

$$ \begin{cases} x - b + \operatorname{sgn} \left( x \right ) & \text{ if } x = \boldsymbol{0} \\ x - b + \operatorname{sgn} \left( x \right ) + \frac{x}{ {\left\| x \right\|}_{2} } & \text{ if } x \neq \boldsymbol{0} \end{cases} \in \partial \left\{ \frac{1}{2} {\left\| x - b \right\|}_{2}^{2} + {\lambda}_{1} {\left\| x \right\|}_{1} + {\lambda}_{2} {\left\| x \right\|}_{2} \right\} $$

Then the Sub Gradient iterations are obvious.

The Split Method

This is based on A Primal Dual Splitting Method for Convex Optimization Involving Lipschitzian, Proximable and Linear Composite Terms.
The used algorithm is 3.2 at page 5 where $ L = I $ Identity Operator and $ F \left( x \right) = \frac{1}{2} \left\| x - b \right\|_{2}^{2} $, $ g \left( x \right) = {\lambda}_{1} \left\| x \right\|_{1} $ and $ h \left( x \right) = {\lambda}_{2} \left\| x \right\|_{2} $.
The Prox Operators are given by the $ {L}_{1} $ and $ {L}_{2} $ Threshold Operators.
One must pay attention to correctly factor the parameters of the Prox as the Moreau’s Identity is used.

The ADMM with 3 Blocks Method

Used the Scaled Form as in Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers Pg. 15.
The ADMM for 3 Blocks is based on Global Convergence of Unmodified 3 Block ADMM for a Class of Convex Minimization Problems.
The split is made by 3 variables which obey $ A x - B y - C z = 0 $ where $ A $ is just the identity matrix repeated twice (Namely replicates the vector - $ A x = \left[ {x}^{T}, {x}^{T} \right]^{T} $. Then using $ B, C $ one could enforce $ x = y = z $ as required.
Each step, since each variable is multiplied by a matrix, is solved using auxiliary algorithm (It is not "Vanilla Prox"). Yet one could extract a Prox function utilizing this specific for of the matrices (Extracting only the relevant part of the vector).

Results

Code

The code is available (Including validation by CVX) at my StackExchange Mathematics Q2595199 GitHub Repository (Look at the Mathematics\Q2595199 folder).

Solution to $ \arg \min_{x} (\frac{1}{2\lambda}||x- y||^2_2+\lambda_0||Bx||_1) $: Least Squares with $ L_1 $ Norm of Linear Term (Proximal Operator)

This would be very long as a comment, so posting here as an answer.

So, let $S := BB^T$. The using the dual representation of $\ell_1$-norm and then Sion's minimax theorem, one computes $$ \begin{split} \frac{1}{2}\|x-a\|^2 + \lambda\|Bx\|_1 &= \frac{1}{2}\|x-a\|^2 + \gamma \sup_{\|z\|_\infty \le 1}z^TBx=\sup_{\|z\|_\infty \le 1}\inf_{x}\frac{1}{2}\|x-a\|_2^2 + \lambda z^TBx\\ &= \sup_{\|z\|_\infty \le 1} z^Ta + \lambda z^TSz, \end{split} $$ which can be effectively solved via a projected gradient algorithm. Note that the gradient of the objective is $a + Sz$, while projection onto the $\ell_\infty$ unit if very easy (reducible to 1D minimization of quadratic function on an interval).