Least Squares Problem with Non-Negativity Constraints – Optimization

convex optimizationleast squaresoptimizationquadratic programming

Let $\mathbf{x}=[x_1,\ldots,x_K]$. I have the following optimization problem:

\begin{array}{rl}
\min \limits_{\mathbf{x}} & \| \mathbf{Ax}-\mathbf{b} \|^2 \\
\mbox{s.t.} & x_k\ge 0, \forall k
\end{array}

Please I need your help to solve this problem.

Another thing: my main problem was
\begin{array}{rl}
\min \limits_{\mathbf{x}} & \| \mathbf{A'x}-\mathbf{b'} \|^2 \\
\mbox{s.t.} & x_k\ge 0, \forall k & \\ & \mathbf{x}^T \mathbf{1}=1
\end{array}
Then I transformed it to the first problem by including the equality constraint in the objective function. Is it fine to do so ?

Best Answer


Your problem can be conveniently re-written as \begin{eqnarray} \underset{x \in \mathbb{R}^K}{\text{min }}f(x) + g(x), \end{eqnarray} where $f: x \mapsto \frac{1}{2}\|Ax-b\|^2$ and $g = i_{\mathbb{R}^K_+}$, the indicator function (in the convex analytic sense) of the nonnegative $K$th orthant. $f$ is smooth with Lipschitz gradient ($\|A\|^2$ is a possible Lipschitz constant) while $g$ has a simple proximal operator $prox_g(x) := (x)_+$ (the orthogonal projector unto the aforementioned orthant). So, proximal methods like FISTA are your friend.

In your "main problem", the aforementioned orthant is simply replaced with the standard simplex. The projector unto this simplex, though inaccessible in closed form, can be computed very cheaply using (for example) the simple algorithm presented in section 3 of the paper http://www.magicbroom.info/Papers/DuchiShSiCh08.pdf.

The code can be implemented in 3 lines of Python:

import numpy as np


def proj_simplex(v, z=1.):
    """Projects v unto the simplex {x >= 0, x_0 + x_1 + ... x_n = z}.

    The method is John Duchi's O (n log n) Algorithm 1.
    """
    # deterministic O(n log n)
    u = np.sort(v)[::-1]  # sort v in increasing order
    aux = (np.cumsum(u) - z) / np.arange(1., len(v) + 1.)
    return np.maximum(v - aux[np.nonzero(u > aux)[0][-1]], 0.)

BTW, what is the proximal operator of an "arbitrary" convex function $g$ ?

Formally, \begin{eqnarray} prox_g(x) := \underset{p \in \mathbb{R}^K}{\text{argmin }}\|p-x\|^2 + g(p). \end{eqnarray}

"Proximable" functions (i.e functions for which the argmin problem in the definition above are easy to solve, for any point $x$) play just as important a rule as differentiable functions. The proximal operator lets you make "implicit gradient steps". Indeed, one has the characterization \begin{eqnarray}p = prox_g(x)\text{ iff } x - p \in \partial g(p), \end{eqnarray} where \begin{eqnarray}\partial g(p) := \{u \in \mathbb{R}^K | g(q) \ge g(p) + \langle u, q - p\rangle \forall q \in \mathbb{R}^K\}\end{eqnarray} is the subdifferential of $g$ at $p$ (this reduces to the singleton $\{\nabla g(p)\}$ if $g$ is differentiable at $p$). In your problem(s) above, the proximal operator happens to be a projection operator. In fact for any closed convex subset $C \subseteq \mathbb{R}^K$, a little algebra reveals that \begin{eqnarray} prox_{i_C}(x) := \underset{p \in \mathbb{R}^K}{\text{argmin }}\|p-x\|^2 + i_C(p) = \underset{p \in C}{\text{argmin }}\|p-x\|^2 =: proj_C(x), \end{eqnarray} where $i_C$ is the indicator function of $C$ defined by $i_C(x) := 0$ if $x \in C$; $+\infty$ otherwise. A less trivial example is the $\ell_1$-norm $\|.\|_1$ whose proximal operator (at rank $\gamma > 0$) is the so-called soft-thresholding operator $prox_{\gamma\|.\|_1}(x) = soft_\gamma(x) = (v_k)_{1\le k \le K}$, where \begin{eqnarray} v_k := \left(1- \dfrac{\gamma}{|x_k|}\right)_+x_k. \end{eqnarray}

Proximal operators are a handy tool in modern convex analysis. They find great use in problems arising in signal processing, game theory, machine learning, etc. Here is a nice place to start learning about proximal operators and similar objects: http://arxiv.org/pdf/0912.3522.pdf.

Most importantly, "under mild conditions" one can show (see the previous reference) that a point $x^*$ minimizes $f + g$ iff

\begin{eqnarray} x^* = prox_{\gamma g}(x^* - \gamma \nabla f(x^*)), \forall \gamma > 0 \end{eqnarray}

Thus the minimizers of $f + g$ coincide with the fixed-points of the operators $prox_{\gamma g}\circ(Id - \gamma \nabla f)$, $\gamma > 0$. This suggests the following algorithm, known as the forward-backward algorithm (Mureau; Lions and Mercier; P.L Combettes et al.) \begin{eqnarray} x^{(n+1)} = \underbrace{prox_{\gamma_n g}}_{\text{backward / prox step}}\underbrace{(x^{(n)} - \gamma_n \nabla f(x^{(n)}))}_{\text{forward / gradient step}}, \end{eqnarray}

for an appropriately chosen sequence of step-sizes $(\gamma_n)_{n \in \mathbb{N}}$.

If $g$ is constant, so that it suffices to minimize $f$ alone, then the above iterates become \begin{eqnarray} x^{(n+1)} = x^{(n)} - \gamma_n \nabla f(x^{(n)}), \end{eqnarray}

and we recognize our old friend, the gradient descent algorithm, taught in high school.

N.B.: $(x)_+$ denotes the componentwise maximum of $x$ and $0$.

Related Question