Does gradient descent converge in convex optimization problems? If so, how

convergence-divergenceconvex optimizationconvex-analysismachine learningoptimization

Sorry in advance if this question sounds too broad or a little bit too obvious.

I know for sure that gradient descent, i.e., the update equation

$x_{k+1} = x_k – \epsilon_k \nabla f(x_k)$

converge to the unique minimizer of $f$ with domain $\text{dom}(f) = \mathbb{R}^n$ whenever $f$ is strictly or strongly convex.

However, I could not remember if it converges to a minimizer in convex functions, and how it achieves this convergence.

What is bothering me is that

  1. I've seen some conflicting results where instead of $x_k$, an averaged sequence $\hat x_{k} = \frac{1}{K} \sum_k x_k$ converges.

  2. I've also seen conflicting results where the step size is decreasing $o(1/k)$ vs it is constant.

  3. There is also the issue of weak vs strong convergence. I'm not sure what this means exactly.

I have know some results but they are for quadratic functions, not for convex functions in general.

Can someone chime in on what this basic result in optimization look like?

Best Answer

In https://stanford.edu/~boyd/papers/pdf/monotone_primer.pdf, section 5, subsection "Gradient method" or "Forward step method" shows a proof for functions that are strongly convex with Lipschitz gradient, using constant step sizes.

If the function is convex with Lipschitz gradient then https://www.stat.cmu.edu/~ryantibs/convexopt-F13/scribes/lec6.pdf shows a proof of convergence with constant step sizes.

For strictly convex functions (not Lipschitz), gradient descent will not converge with constant step sizes. Try this very simple example, let $f(x) = x^{4}$. You will see that there is no constant step size for gradient descent that will converge to $0$ (for any initial condition). In this case, people use diminishing step sizes. I've never seen results about convergence of the average but it might be what happens when you use a constant step size on functions that are strictly convex and don't have Lipschitz gradient.

Unless $x$ lives in an infinite-dimensional space, weak and strong convergence is the same and I wouldn't worry about it. Here is additional reading if you are interesting, https://people.maths.bris.ac.uk/~mazag/fa17/lecture9.pdf

Related Question