Convex Optimization, Non-negativity constraints, Interior-Point or Projected Gradient

convex optimizationgradient descentnonlinear optimization

Assume I have the following convex optimization problem, with a convex objective function on conventional non-negativity constraints.

\begin{align}
\min_{x \geq 0} \sum_{i=1}^{I} a_{i}x_{i} – f(\bf{x}),
\end{align}

where $\bf{x}$ is a vector of dimension I, $f(\bf{x})$ is a concave function, and the dimension of $I$ is very large. The solution is sparse, meaning that in the optimal vector there is a large amount of zeros. This comes from the fact that for some entries $i$

\begin{align}
a_{i} > f_{i}(\bf{x}^{*})
\end{align}

where $f_{i}(\bf{x})$ corresponds to $\partial f(\bf{x}^{*})/x_{i}$. Since the function is convex, I know that the KKT conditions are necessary and sufficient for global optimality, all local optima are also global, and the set of solutions is convex.

However, I have questions about how to solve for the optimum numerically,

  1. If I use a Projected Gradient Method, I have seen on the internet that the projection step of the algorithm is just

$$ x^{k+1} = \max\left\{0,x^{k}-\alpha_{k}(a-\nabla f(x^{k}))\right\} $$

However, I have not found a proof or a paper that shows this. Can someone provide a proof or reference?

  1. The other option is to use Interior-Point methods. However, what confuses me in this case is that when one uses barrier methods, how does it work if the solution is not interior, but in the boundary for some entries?

Best Answer

Disclaimer. This would be a very long comment. Posting here in hopes that the OP finds it helpful. Please google the references I mention rather colloquially below.


Let $F:\mathbb R^n \to \mathbb R$ be a differentiable convex function and let $G:\mathbb R^n \to \mathbb R \cup \{\infty\}$ be an extended-value proper convex function. Consider the minimization of the sum $H:=F+G$. By first-order optimality conditions, a point $x_\star \in \mathbb R^n$ is a minimizer of $H$ iff $0 \in \partial H(x_\star)$, the subdifferential of $H$ at $x_\star$. Now, under mild conditions, the subdifferential of $H$ is the sum of subdifferentials of $F$ and $G$ pointwise. We deduce that $$ 0 \in \nabla F(x_\star) + \partial G(x_\star),\text{ i.e }-\nabla F(x_\star) \in \partial G(x_\star). \tag{1} $$

Things like (1) are referred to as variational inequalities / inclusions (a terminology likely due to R.T. Rockafellar).

Continuing our intuitionist démarche, observe that, for positive $\alpha$, (1) can be written as $x_\star - \alpha \nabla F(x_\star) \in x_\star + \alpha \partial G(x_\star) = (I + \alpha\partial G)(x_\star)$. Now, because $\partial G$ is a maximal monotone operator (due to a result of Rockafellar from the 70s), we can invert the last variational inclusion to get

$$ x_\star = (I+\alpha \partial G)^{-1}(x_\star-\alpha \nabla F(x_\star)) = \mbox{prox}_{\alpha G}(x_\star-\alpha F(x_\star)), $$

where $\mbox{prox}_{\alpha G}:\mathbb R^n \to \mathbb R^n$ is the proximal operator of $G$ at level $\alpha$, defined by

$$ \mbox{prox}_{\alpha G}(x) := (I + \alpha\partial G)^{-1} (x) = \arg\min_{z \in \mathbb R^n} (1/2) \|z-x\|_2^2 + \alpha G(z). $$ Thus,

The minimizers of $x_\star$ of $H$ and the fixed-points of the operator $x \mapsto \mbox{prox}_{\alpha G}(x-\alpha\nabla F(x))$ coincide. The proximal / projected gradient algorithms can then be seen as being nothing but algorithmic ways for computing such fixed points!

Now, Lions and co-workers have shown that if $F$ is, say, with Lipschitz-continuous gradient, then for appropriate stepsizes $(\alpha_k)_k$, (1) can be solved by iterating the following scheme:

  • $x^{k+1} \leftarrow \mbox{prox}_{\alpha_k G}(x^k - \alpha_k \nabla F(x_k))$
  • $k \leftarrow k + 1$,

Finally, in case $G$ is the indicator function of a closed convex subset $C$ of $\mathbb R^n$ (meaning that $G$ evalutes to $0$ on $C$, and $\infty$ outside of $C$), then $\mbox{prox}_{G} = \mbox{proj}_C$.