[Math] Optimization of Convex Function with $ {L}_{\infty} $ Norm Regularization (Composition Model)

convex optimizationgradient descentnumerical optimizationoptimizationproximal-operators

I'm trying to solve an optimization problem of the form

$$\text{minimize } \; f(x) + \|x\|_\infty$$

where $x$ ranges over all of $\mathbb{R}^n$ and $f:\mathbb{R}^n \to \mathbb{R}$ is a nice, smooth, convex, differentiable function. Notice the $L_\infty$ penalty term — this is some kind of $L_\infty$-regularized convex optimization problem.

What techniques are available for solving this? Is there any theory to guide us when facing this kind of optimization problem?

I've tried using (sub)gradient descent based methods on the function $g(x) = f(x) + \|x\|_\infty$, but this works poorly. Is there a better method that will be more effective?

What goes wrong with gradient descent. Let me try to illustrate what goes wrong with an example. After a little while, we reach a value of $x$ that looks something like $x=(2.000, 1.999)$ (picking an example, and working in 2 dimensions, for ease of exposition), where the gradient suggests decreasing $x_1$ and increasing $x_2$ (i.e., $\frac{\partial g}{\partial x_1}(x) > 0$ and $\frac{\partial g}{\partial x_2}(x) < 0$). On the next step we might move to the point $x=(1.999, 2.000)$. At this point the gradient might point in the opposite direction (i.e., $\frac{\partial g}{\partial x_1}(x) < 0$ and $\frac{\partial g}{\partial x_2}(x) > 0$), so we move back to $x=(2.000, 1.999)$, basically oscillating around the diagonal line $x_1=x_2$.

Basically, it gets near the diagonal line, then zig-zags back and forth across it. What seems to be going wrong here is that gradient descent uses the gradient to extrapolate the improvement if we move in a particular direction; however, in actuality, the gradient of $||x||_{\infty}$ has a sharp discontinuity at the diagonal line, so it's a bad idea to use the value of the gradient on one side of the diagonal line to extrapolate the improvement if we continue moving across this discontinuity.

I gave an example in two dimensions. In $n$ dimensions, what seems to happen is that it's very likely there exists some pair of dimensions where this happens (i.e., say $x_i$ is the largest component of $x$; then often there exists an index $j$ such that projecting $x$ onto the two dimensions $i,j$ behaves as above).

What research I've done so far. I've been reading about a bunch of methods (penalty methods, barrier search, proximal search, etc.) but haven't found discussion of optimization problems that include the $L_\infty$ norm as one term. I don't see how to apply the penalty nor barrier methods, since there is no constant $c$ such that we want to force $x_i \le c$ for all $i$. Trying to use a penalty term like $\sum_i (x_i-\|x\|_\infty)^2$ seems unlikely to help, as it still has the sharp discontinuity in gradient. Using a barrier penalty like $-\sum_i \log (\|x\|_\infty – x_i)$ seems obviously broken, as at least one term in the sum will always be $\infty$.

I noticed we can express the optimization problem equivalently as
$$\begin{align*}
\text{minimize } \; &f(x) + t\\
\text{subject to } \;&(x,t) \in \mathcal{C}
\end{align*}$$
where $\mathcal{C} = \{(x,t) : x_i \le t \text{ for $i=1,2,\dots,n$}\}$… but I couldn't see how to use this to help. Applying a penalty function or barrier method to this, to turn it into an unconstrained optimization function (e.g., by penalizing solutions $x_i > t$), seems likely to run back into the same problem, as when $(x,t) \in \mathcal{C}$ the partial derivative with respect to $x_i$ is uninformative about what happens if a single iteration takes us outside $\mathcal{C}$.

Maybe there is some simple, smooth approximation $h(x)$ that we can replace $\|x\|_\infty$ with, and then minimize $f(x)+h(x)$? Maybe $h(x)=\|x\|_\infty + 1/\alpha \|x\|_2^2$ for some large value of $\alpha>0$, or something like that? Or maybe using a softplus function, e.g., something like $h(x) = \|x\|_\infty + \sum_i S(x_i – \|x\|_\infty)$ where $S(v)=\log(1+e^v)$ is the softplus function (a smoother approximation to $\max(0,v)$)? This sounds a bit ad-hoc; is there any theory or practical guidelines?

Best Answer

One easy and effective option is to solve this problem using the proximal gradient method or FISTA (which is an accelerated version of the proximal gradient method).

The proximal gradient method minimizes $f(x) + g(x) $ where $f $ and $g$ are convex and $f $ is smooth and $g$ is "simple", in the sense that the proximal operator of $g $ can be evaluated efficiently. (We also require that $f $ and $g $ be lower semi-continuous, but that is a mild assumption that is usually satisfied in practice.) For large scale problems of this form, FISTA is often state of the art.

You can just take $g (x) = \|x\|_{\infty}$. The prox-operator of the $\ell_\infty$-norm is well known, and can be found for example in Vandenberghe's 236c notes. I discussed the prox-operator of the $\ell_\infty$-norm in a bit more detail here.

Related Question