Convex Optimization – Proximal Mapping of Least Squares with L1 and L2 Norm Regularization

convex optimizationconvex-analysisnormed-spacesproximal-operatorssubgradient

I was trying to solve
$$\min_x \frac{1}{2} \|x – b\|^2_2 + \lambda_1\|x\|_1 + \lambda_2\|x\|_2,$$

where $ b \in \mathbb{R}^n$ is a fixed vector, and $\lambda_1,\lambda_2$ are fixed scalars. Let $f = \lambda_1\|x\|_1 + \lambda_2\|x\|_2$, that is to say my question is how to find out the proximal mapping of $f$. It formulates as

$$\begin{equation}
prox_f(b)=arg\min_x\{ \frac{1}{2}\|x – b\|_2 + \lambda_1\| x \|_1 + \lambda_2\| x \|_2 \}.
\label{eq1}
\end{equation}$$

There are two ways to get proximal mapping of $l_2$-norm and $l_1$-norm respectively.

For $l_1$-norm, soft threshold operator was given in
Derivation of soft threshold operator.
For $l_2$-norm, block soft threshold was given in
deriving block soft threshold from l2 norm.

EDIT: I got stuck to find the subgradient of the object function. I followed above-mention methods to solve my problem. The subgradient of original target shows as,
$$\begin{equation}
0 \in x – b + \lambda_1 \partial \|x\|_2 + \lambda_2 \partial \|x\|_1.
\label{eq2}
\end{equation}$$

I guess that it should be discussed for different conditions:

  • If $x = 0$, then $\partial \|x\|_1 = \{g: g_i \in [-1,1]\}$ and $\partial \|x\|_2 = \{g: \|g\|_2 \leq 1\}$, where $g_i$ denotes $i$th element of $g$. Thus, I got
    $$
    0\in \lambda_1 \{g: g_i \in [-1,1]\} + \lambda_2 \{g:\|g\|_2 \leq 1 \} – b
    \\ \Leftrightarrow b \in \lambda_1 \{g: g_i \in [-1,1]\} + \lambda_2 \{g:\|g\|_2 \leq 1 \}.
    $$
    It implies that for $\|b\|_2 < \lambda_2$ or for all $|b_i|_1 < \lambda_1$, the optimal condition is $x = 0$.
  • If $x \neq 0$, then $\partial \|x\|_2 = x/\|x\|_2$, and the optimal is
    $$
    b \in \lambda_1 \partial \|x\|_1 + \lambda_2 \frac {x}{\|x\|_2} + x.
    $$
    If $x_i = 0$, then $\partial |x_i|= sgn(x_i)$, where $sgn(x_i)$ takes the sign of $x_i$. I guess that it also should be discussed the conditions that whether or not $x_i = 0$ by each component. But the question is that I don't know how to discuss. The reason is that $x_i$ is restricted by $\|x\|_2$, and each dimension cannot be separated.

I would really appreciate help on solving my problem. Thanks very much.

Best Answer

Analytic Solution

Remark
This derivation is extension of dohmatob's solution (Extending details not given in the linked PDF).

Defining:

$$ \hat{x} = \operatorname{prox}_{{\lambda}_{1} {\left\| \cdot \right\|}_{1} + {\lambda}_{2} {\left\| \cdot \right\|}_{2}} \left( b \right) = \arg \min_{x} \left\{ \frac{1}{2} {\left\| x - b \right\|}_{2}^{2} + {\lambda}_{1} {\left\| x \right\|}_{1} + {\lambda}_{2} {\left\| x \right\|}_{2} \right\} $$

This implies:

$$ 0 \in \hat{x} - b + {\lambda}_{1} \partial {\left\| \hat{x} \right\|}_{1} + {\lambda}_{2} \partial {\left\| \hat{x} \right\|}_{2} $$

Where:

$$ u \in \partial {\left\| \cdot \right\|}_{1} \left( \hat{x}_{i} \right) = \begin{cases} \left[-1, 1 \right] & \text{ if } \hat{x}_{i} = 0 \\ \operatorname{sgn}\left( \hat{x}_{i} \right) & \text{ if } \hat{x}_{i} \neq 0 \end{cases} , \; v \in \partial {\left\| \cdot \right\|}_{2} \left( x \right) = \begin{cases} \left\{ z \mid \left\| z \right\|_{2} \leq 1 \right\} & \text{ if } \hat{x} = \boldsymbol{0} \\ \frac{ \hat{x} }{ \left\| \hat{x} \right\|_{2} } & \text{ if } \hat{x} \neq \boldsymbol{0} \end{cases} $$

Notes

  • The optimization problem tries to minimize $ \hat{x} $ norms while keeping it close to $ b $.
  • For any element which is not zero in $ \hat{x} $ its sign is identical to the corresponding element in $ b $. Namely $ \forall i \in \left\{ j \mid \hat{x}_{j} \neq 0 \right\}, \, \operatorname{sgn} \left( \hat{x}_{i} \right) = \operatorname{sgn} \left( b \right) $. The reason is simple, If $ \operatorname{sgn} \left( \hat{x}_{i} \right) \neq \operatorname{sgn} \left( b \right) $ then by setting $ \hat{x}_{i} = -\hat{x}_{i} $ one could minimize the distance to $ b $ while keeping the norms the same which is a contradiction to the $ \hat{x} $ being optimal.

Case $ \hat{x} = \boldsymbol{0} $

In this case the above suggests:

$$ b = {\lambda}_{1} u + {\lambda}_{2} v \iff b - {\lambda}_{1} u = {\lambda}_{2} v $$

Since $ {u}_{i} \in \left[ -1, 1 \right] $ and $ \left\| v \right\|_{2} \leq 1 $ one could see that as long as $ \left\| b - {\lambda}_{1} u \right\|_{2} \leq {\lambda}_{2} $ one could set $ \hat{x} = \boldsymbol{0} $ while equality of the constraints hold. Looking for the edge cases (With regard to $ b $) is simple since it cam be done element wise between $ b $ and $ u $. It indeed happens when $ v = \operatorname{sign}\left( b \right) $ which yields:

$$ \hat{x} = \boldsymbol{0} \iff \left\| b - {\lambda}_{1} \operatorname{sign} \left( b \right) \right\|_{2} \leq {\lambda}_{2} \iff \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} \leq {\lambda}_{2} $$

Where $ \mathcal{S}_{ \lambda } \left( \cdot \right) $ is the Soft Threshold function with parameter $ \lambda $.

Case $ \hat{x} \neq \boldsymbol{0} $

In this case the above suggests:

$$ \begin{align*} 0 & = \hat{x} - b + {\lambda}_{1} u + {\lambda}_{2} \frac{ \hat{x} }{ \left\| \hat{x} \right\|_{2} } \\ & \iff b - {\lambda}_{1} u = \left( 1 + \frac{ {\lambda}_{2} }{ \left\| \hat{x} \right\|_{2} } \right) \hat{x} \end{align*} $$

For elements where $ {x}_{i} = 0 $ it means $ \left| {b}_{i} \right| \leq {\lambda}_{1} $. Namely $ \forall i \in \left\{ j \mid \hat{x}_{j} = 0 \right\}, \, {b}_{i} - {\lambda}_{1} v = 0 \iff \left| {b}_{i} \right| \leq {\lambda}_{1} $. This comes from the fact $ {v}_{i} \in \left[ -1, 1 \right] $.

This makes the left hand side of the equation to be a Threhsolding Operator, hence:

As written in notes Under the assumption $ \forall i, \, \operatorname{sign} \left( \hat{x}_{i} \right) = \operatorname{sign} \left( {b}_{i} \right) $ the above becomes:

$$ \mathcal{S}_{ {\lambda}_{1} } \left( b \right) = \left( 1 + \frac{ {\lambda}_{2} }{ \left\| \hat{x} \right\|_{2} } \right) \hat{x} $$

Looking on the $ {L}_{2} $ Norm of both equation sides yields:

$$ \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} = \left( 1 + \frac{ {\lambda}_{2} }{ \left\| \hat{x} \right\|_{2} } \right) \left\| \hat{x} \right\|_{2} \Rightarrow \left\| \hat{x} \right\|_{2} = \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} - {\lambda}_{2} $$

Plugging this into the above yields:

$$ \hat{x} = \frac{ \mathcal{S}_{ {\lambda}_{1} } \left( b \right) }{ 1 + \frac{ {\lambda}_{2} }{ \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} - {\lambda}_{2} } } = \left( 1 - \frac{ {\lambda}_{2} }{ \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} } \right) \mathcal{S}_{ {\lambda}_{1} } \left( b \right) $$

Remembering that in this case it is guaranteed that $ {\lambda}_{2} < \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} $ hence the term in the braces is positive as needed.

Summary

The solution is given by:

$$ \begin{align*} \hat{x} = \operatorname{prox}_{{\lambda}_{1} {\left\| \cdot \right\|}_{1} + {\lambda}_{2} {\left\| \cdot \right\|}_{2}} \left( b \right) & = \begin{cases} \boldsymbol{0} & \text{ if } \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} \leq {\lambda}_{2} \\ \left( 1 - \frac{ {\lambda}_{2} }{ \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} } \right) \mathcal{S}_{ {\lambda}_{1} } \left( b \right) & \text{ if } \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2} > {\lambda}_{2} \end{cases} \\ & = \left( 1 - \frac{ {\lambda}_{2} }{ \max \left\{ \left\| \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \right\|_{2}, {\lambda}_{2} \right\} } \right) \mathcal{S}_{ {\lambda}_{1} } \left( b \right) \\ & = \operatorname{prox}_{ {\lambda}_{2} {\left\| \cdot \right\|}_{2} } \left( \operatorname{prox}_{ {\lambda}_{1} {\left\| \cdot \right\|}_{1} } \left( b\right)\right) \end{align*} $$

This matches the derivation in the paper On Decomposing the Proximal Map (See Lecture Video - On Decomposing the Proximal Map) mentioned by @littleO.

Solving as an Optimization Problem

This section will illustrate 3 different methods to the above problem (Very similar to Elastic Net Regularization).

Sub Gradient Method

The Sub Gradient of the above is given by:

$$ \begin{cases} x - b + \operatorname{sgn} \left( x \right ) & \text{ if } x = \boldsymbol{0} \\ x - b + \operatorname{sgn} \left( x \right ) + \frac{x}{ {\left\| x \right\|}_{2} } & \text{ if } x \neq \boldsymbol{0} \end{cases} \in \partial \left\{ \frac{1}{2} {\left\| x - b \right\|}_{2}^{2} + {\lambda}_{1} {\left\| x \right\|}_{1} + {\lambda}_{2} {\left\| x \right\|}_{2} \right\} $$

Then the Sub Gradient iterations are obvious.

The Split Method

This is based on A Primal Dual Splitting Method for Convex Optimization Involving Lipschitzian, Proximable and Linear Composite Terms.
The used algorithm is 3.2 at page 5 where $ L = I $ Identity Operator and $ F \left( x \right) = \frac{1}{2} \left\| x - b \right\|_{2}^{2} $, $ g \left( x \right) = {\lambda}_{1} \left\| x \right\|_{1} $ and $ h \left( x \right) = {\lambda}_{2} \left\| x \right\|_{2} $.
The Prox Operators are given by the $ {L}_{1} $ and $ {L}_{2} $ Threshold Operators.
One must pay attention to correctly factor the parameters of the Prox as the Moreau’s Identity is used.

The ADMM with 3 Blocks Method

Used the Scaled Form as in Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers Pg. 15.
The ADMM for 3 Blocks is based on Global Convergence of Unmodified 3 Block ADMM for a Class of Convex Minimization Problems.
The split is made by 3 variables which obey $ A x - B y - C z = 0 $ where $ A $ is just the identity matrix repeated twice (Namely replicates the vector - $ A x = \left[ {x}^{T}, {x}^{T} \right]^{T} $. Then using $ B, C $ one could enforce $ x = y = z $ as required.
Each step, since each variable is multiplied by a matrix, is solved using auxiliary algorithm (It is not "Vanilla Prox"). Yet one could extract a Prox function utilizing this specific for of the matrices (Extracting only the relevant part of the vector).

Results

enter image description here

Code

The code is available (Including validation by CVX) at my StackExchange Mathematics Q2595199 GitHub Repository (Look at the Mathematics\Q2595199 folder).

Related Question