[Math] Stopping criteria for gradient method

convex optimizationnumerical optimizationoptimization

For numerically solving a smooth convex optimization $\min\{f(x): x\in S\}$ where $S$ is a closed convex set, we can apply some different algorithms: gradient method, accelerated gradient, proximal gradient … depending on the structure of the problem. Solving is to find a solution $x*$ such that $f(x^*)=\inf\{f(x): x\in S\}:=f^*$. To this end, we try to construct an iterative sequence $\{x^k\}$ that converges to some solution $x^*$, or the sequence of numbers $\{f(x^k)\}$ tends to $f^*$. Note that, if $x^k\to x^*$ then the continuity of $f$ can ensures that $f(x^k) \to f^*$.

My questions are:

In which cases we should focus on the convergence of $\{f(x^k)\}$ rather than of $\{x^k\}$? Is finding a point $x^K$ such that $x^K$ close enough to a solution $x^*$ better than finding a point $x^K$ such that $f(x^K)$ close enough to $f^*$?
What is the best stopping criteria for an algorithm? I know the following ways:
- Determine the number of iterations we need to perform to achieve a desired error $\epsilon$, i.e., $||x^k-x^*||<\epsilon$ or $|f(x^k)-f^*|<\epsilon$ implies $k\geq N$ for some $N$. I see that this way is very reliable.
- terminating when $||x^{k+1}-x^k||$ or $|f(x^{k+1})-f(x^k)|$ is small enough.
- terminating when $||\nabla f(x^k)||$ is small enough.

Could you explain how the second and the third cases work? Why $||\nabla f(x^k)||$ small enough can implies that $f(x^k)$ is approximate the optimal value $f^*$. I have been know that the case $f$ is strongly convex this can be verified. Is this stopping criteria still reliable in the case where $f$ is not strongly convex?

Best Answer

I will discuss the termination criteria for the simple gradient method $x_{k+1} = x_{k} - \frac{1}{L}\nabla f(x_k)$ for unconstrained minimisation problems. If there are constraints, then we would use the projected gradient method, but similar termination condition hold (imposed on the norm of the difference $x_k-z_k$).

The third criterion, namely $\|\nabla f(x_k) \| < \epsilon$ if fine for strongly convex functions with $L$-Lipschitz gradient. Indeed, if $f$ is $\mu$-strongly convex, that is

$$\begin{aligned} f(y) \geq f(x) + \nabla f(x)^\top (y-x) + \tfrac{\mu}{2} \|y-x\|^2 \end{aligned},\tag{1} $$

then, for $x^*$ such that $\nabla f(x^*)=0$ (the unique minimiser of $f$), we have

$$\begin{aligned} f(x) - f(x^*)\leq \tfrac{1}{2\mu}\|\nabla f(x) \|^2, \end{aligned}\tag{2} $$

so, if $\|\nabla f(x) \|^2 < 2\mu\epsilon$, then $f(x) - f(x^*) < \epsilon$, i.e., $x$ is $\epsilon$-suboptimal.

But termination is a mysterious thing... In general (under the assumptions you drew) it is not true that we will have $\|x-x^*\|<\epsilon$ if $\| \nabla f(x) \| < \kappa \epsilon$, for some $\kappa > 0$ (not even locally). There might be specific cases where such a bound holds, notwithstanding. Unless you draw some additional assumptions on $f$, this will not be a reliable termination criterion.

However, strong convexity is often too strong a requirement in practice. Weaker conditions are discussed in the article: D. Drusvyatskiy and A.S. Lewis, Error bounds, quadratic growth, and linear convergence of proximal methods, 2016.

Let $f$ be convex with $L$-Lipschithz gradient and define $\mathcal{B}_\nu^f = \{x: f(x) - f^* < \nu\}$. Let us assume that $f$ has a unique minimiser $x^*$ (e.g., $f$ is strictly convex). Then assume that $f$ has the property

$$\begin{aligned} f(x) - f(x^*) \geq \tfrac{\alpha}{2} \|x-x^*\|^2, \end{aligned}\tag{3}\label{3}$$

for all $x\in\mathcal{B}_\nu^f$ for some $\nu>0$. Functions which satisfy this property are not necessarily strongly convex. As a counterexample we have $f = (\max\{|x|-1,0\})^2$. Of course if $f$ is strongly convex the above holds and if $f$ is given in the form $f(x) = h(Ax)$ where $h$ is a strongly convex function and $A$ is any matrix.

Then, condition \eqref{3} is shown to be equivalent to

$$\begin{aligned} \|x-x^*\| \leq \frac{2}{\alpha} \|\nabla f(x) \|, \end{aligned}\tag{4}\label{4}$$

for all $x\in\mathcal{B}_{\nu}^f$ and with $\alpha < 1/L$.

Clearly in this case we may use the termination condition $\| \nabla f(x) \| < \epsilon\alpha/2$ which will imply that $\|x-x^*\| < \epsilon$.

In regard to the second condition, you may use it again for strongly convex functions or if \eqref{3} holds locally about $x^*$. The reason for that is that the following bound holds for the gradient method:

$$\begin{aligned} \tfrac{L}{2}\|\nabla f(x_k) \|^2 \leq f(x_k) - f(x_{k+1}). \end{aligned}\tag{5}\label{5}$$

The right hand side of \eqref{5} is further upper bounded by $L_f \|x_k - x_{k+1}\|$, where $L_f$ is the Lipschitz constant of $f$ (we know that $f$ is Lipshcitz continuous), so a condition on $\|x_{k+1}-x_{k}\|$ may potentially be used, but we may see that the basis for all this is the bound on $\|\nabla f(x_k) \|$.

Best Answer

Related Solutions

[Math] Intuition Behind Accelerated First Order Methods

Convex Optimization – Why Projected Gradient Descent Method Works

Related Question