Convex Optimization – Why Projected Gradient Descent Method Works

convex optimizationconvex-analysisgradient descentnumerical optimizationprojection

Consider the problem
\begin{align*}
\min_{x \in \mathbb{R}^n} &\quad f(x) \\
s.t.: &\quad x \in C,
\end{align*}
where $C$ is a convex set. As $C$ is convex, the projection onto $C$, $P_C$, is well defined for every $x \in \mathbb{R}^n$. The projected gradient method is a method that proposes solving the above optimization problem taking steps of the form $x_{t+1} = P_C[x_t – \eta \nabla f(x_t)]$.

It is well known that for unconstrained problems the gradient has the maximum slope direction, but why does it still work after projecting? Namely, is there a result in the way of the following?

"Exists $\varepsilon > 0$ so that for $0 < \delta < \varepsilon$, we have $f(P_C[x – \delta\nabla f(x)]) \le f(x)$"

As a bonus question, when there are more than one convex restrictions (like $x \in C$ and $x \in D$, with both convex sets), there is an alternated projection method that consists in projecting repeteadly to $C$ and then to $D$. The sequence of projections converge to a point in $C\cap D$, if $C \cap D \ne \emptyset$. But that limit is not necessarily the projection of the initial point onto $C \cap D$, so why alternated projections can work with gradient descent?

Best Answer

I think a rigorous proof of convergence is not so easy but can certainly be found in several books on the topic. I will give you a heuristic approach to see the plausibility of the algorithm. I focus on the first part of your question for now.

Heuristic arguments

While going into the direction of the negative gradient $-\nabla f(x)$ leads to descent, this is not the only direction with this property. Actually, for every direction $d\in\Bbb R^n$ with $\def\<{\langle}\def\>{\rangle}\<d,\nabla f(x)\> < 0$ there is some $\epsilon > 0$ so that $$f(x+\eta d)< f(x),\qquad\text{for all $\eta<\epsilon$}.$$ However, if $\<d,\nabla f(x)\>\ge 0$, then this direction leads to ascent (or at least no descent).

$\qquad\qquad\qquad\qquad\qquad\qquad\quad$

I assume by projection you mean $$P_C[x] := \mathop{\mathrm{argmin}}\limits_{c\in C} d(x,c).$$ There is some insightful way to find this projection: every $x\in\Bbb R^n$ can be written as $x=c+v$ in a unique way, where $c\in C$ and $v\in N_C(c)$ is from the normal cone (also called polar cone) of $C$ at $c$. We then have $P_C[x]=c$. Or to put it another way: $$P_C[x]=c\quad\Longleftrightarrow \quad x-c\in N_C(c).$$

$\qquad\qquad\qquad\qquad\qquad\qquad$

We now show that if $x_{t+1}=P_C[x_t-\eta \nabla f(x_t)]$ is different from the current point $x_t$, then it lies in the direction of descent. In other words: $x_{t+1}-x_t$ is a direction of descent, or $$\<x_{t+1}-x_t,\nabla f(x_t)\>< 0.$$ This is very evident from a picture, but it's a bit messy to prove. Also, this does not show that $f(x_{t+1})<f(x_t)$, but it makes it plausible that this will happen if $\eta$ is sufficiently small. So let's assume the opposite, that $x_{t+1}$ lies in the direction of ascent: $$\<x_{t+1}-x_t,\nabla f(x_t)\>\ge 0\qquad\implies\qquad (*)\;\; \<x_t-x_{t+1},\color{lightgray}{\underset{\begin{array}{c}\uparrow\\[-0.5ex] \llap{\text{adding a $\eta>0$ does not}}\rlap{\text{ hurt the inequality}}\end{array}}{\color{black}{\eta}}}\nabla f(x_t)\>\le 0.$$ Because the unconstrained gradient descent step $\color{red}{x_t - \eta\nabla f(x)}$ projects to $x_{t+1}$, it can be written as $\color{blue}{x_{t+1}+v}$ for some $v\in N_C(x_{t+1})$. Since $v$ is in the normal cone, this gives $$\<c-x_{t+1},v\>\le 0,\;\text{for all $c\in C$}\qquad\implies\qquad (**)\;\; \<x_t-x_{t+1},v\>\le 0.$$ By adding $(*)$ and $(**)$ we obtain $$(\times)\;\; \<x_t-x_{t+1},v+\eta\nabla f(x_t)\> \le 0.$$ But actually both factors of this inner product are equal, as evident from $$\color{red}{x_t-\eta\nabla f(x_t)} = \color{blue}{x_{t+1}+v}\qquad\implies\qquad x_t-x_{t+1} = v+\eta\nabla f(x_t).$$ So $(\times)$ gives us $\|x_t-x_{t+1}\|^2\le 0\implies x_t=x_{t+1}$. Thus, the only way to not make a move in a descent direction is to not move at all.

$\qquad\qquad\qquad\qquad\qquad\qquad$

It remains to show that not moving only happens if there is no descent direction available, hence that we are in a local minimum. If $x_t-\eta\nabla f(x)$ projects to $x_t$ again, then this means \begin{align} x_t-\eta\nabla f(x)-x_t\in N_C(x_t) \quad &\implies\quad -\nabla f(x)\in N_C(x_t) \\ &\implies\quad \<c-x_t,\nabla f(x_t)\> \ge 0,\; \text{for all $c\in C$}. \end{align} The last line means that moving in the direction of any point $c\in C$ (and in a convex set these are all available directions) will lead to ascent, hence that we are indeed in a local minimum.

Best Answer

Related Solutions

[Math] Projected Gradient Method with Multiple Constraints (Projection of Each Row to the Unit Simplex)

[Math] Projected gradient descent with momentum

Related Question