Prove that gradient descent is a contraction for strongly convex smooth functions

convex optimizationconvex-analysisnonlinear optimizationoptimization

I am trying to prove that gradient descent on a $\alpha$-strongly convex, $\beta$-smooth function $f$ is a contractive operator. I am looking for a general proof which does not assume existence of the second derivative, that also obtains the optimal bound, which I believe is $\max\{\left|1 – \lambda\alpha\right|, \left|1 – \lambda\beta\right|\}$, where $\lambda$ is the step size (see here page 15 (page 13 using the pdf page counter)). This is my effort so far:

Gradient descent can be written in the form of the operator $T = (I – \lambda\nabla f)$.

$$
\|Tx – Ty \|^{2} = \|x – \lambda\nabla f(x) – y +\lambda \nabla f(y) \|^{2} \\
= \|(x- y) – \lambda(\nabla f(x) – \nabla f(y))\|^{2} \\
= \|x – y\|^{2} + \lambda^{2}\|\nabla f(x) – \nabla f(y)\|^{2} – 2\lambda\left<x – y, \nabla f(x) – \nabla f(y)\right>
$$

By using the co-coercivity property of the $(\beta – \alpha)$ smooth function $f(x) – \frac{\alpha}{2}\|x\|^{2}$. We get the bound:

$$
\left<x – y, \nabla f(x) – \nabla f(y)\right> \geq \frac{\alpha\beta}{\alpha + \beta}\|x – y\|^{2} + \frac{1}{\alpha + \beta}\|\nabla f(x) – \nabla f(y)\|^{2}
$$

Combining this inequality with the above we have:

$$
\|Tx – Ty \|^{2} \leq \|x – y\|^{2} + \lambda^{2}\|\nabla f(x) – \nabla f(y)\|^{2} – 2\lambda\left(\frac{\alpha\beta}{\alpha + \beta}\|x – y\|^{2} + \frac{1}{\alpha + \beta}\|\nabla f(x) – \nabla f(y)\|^{2}\right)
$$

Simplifying we have:
$$
\|Tx – Ty \|^{2} \leq \left(1 – 2\lambda\left(\frac{\alpha\beta}{\alpha + \beta}\right)\right)\|x – y\|^{2} + \left(\lambda^{2} – \frac{2\lambda}{\alpha + \beta}\right)\|\nabla f(x) – \nabla f(y)\|^{2}
$$

Note setting $\lambda = \frac{2}{\alpha + \beta}$ recovers the convergence rate for gradient descent here. How do I proceed from this point? One idea I have is to try and ignore the last term by ensuring

$$
\lambda^{2} – \frac{2\lambda}{\alpha + \beta} \leq 0 \implies \lambda \leq \frac{2}{\alpha + \beta}
$$

This gives the inequality:

$$
\|Tx – Ty \|^{2} \leq \left(1 – 2\lambda\left(\frac{\alpha\beta}{\alpha + \beta}\right)\right)\|x – y\|^{2}
$$

EDIT: I guess we also need to ensure that

$$
\left(1 – 2\lambda\left(\frac{\alpha\beta}{\alpha + \beta}\right)\right) \geq 0 \implies \lambda \leq \frac{\alpha + \beta}{2\alpha\beta}
$$

But then I am unsure on how to proceed from here. Perhaps there is a way I can use $\beta$-smoothness instead to bound the $\|\nabla f(x) – \nabla f(y)\|$ term?

Best Answer

Okay, the proof is actually simple from this point. Choose the domain $\lambda \in (0, \frac{2}{\beta})$. This is chosen as at $\frac{2}{\beta}$, the map becomes non-expansive (this is easy to verify).

Recall the bound we had earlier:

$$ \|Tx - Ty \|^{2} \leq \left(1 - 2\lambda\left(\frac{\alpha\beta}{\alpha + \beta}\right)\right)\|x - y\|^{2} + \left(\lambda^{2} - \frac{2\lambda}{\alpha + \beta}\right)\|\nabla f(x) - \nabla f(y)\|^{2} $$

When $\lambda \leq \frac{2}{\alpha +\beta}$, the second term on the right-hand side is nonpositive and thus we can replace $\|\nabla f(x) - \nabla f(y)\|$ with a lower bound. We will use the lower bound that follows trivially from $\alpha$-strong convexity:

$$ \|\nabla f(x) - \nabla f(y)\| \geq \alpha \|x - y\| $$

Plugging this in then gives:

$$ \|Tx - Ty \|^{2} \leq \left(1 - 2\lambda\left(\frac{\alpha\beta}{\alpha + \beta}\right)\right)\|x - y\|^{2} + \left(\lambda^{2}\alpha^{2} - \frac{2\alpha^{2}\lambda}{\alpha + \beta}\right)\|x - y\|^{2} $$

Rearranging and simplifying then yields:

$$ \|Tx - Ty\|^{2} \leq (\lambda{2}\alpha^{2} - 2\lambda\alpha + 1)\|x-y\|^{2} $$

Now factorising we have

$$ \|Tx - Ty\|^{2} \leq (1- \lambda\alpha)^{2}\|x-y\|^{2} \iff \|Tx - Ty\| \leq (1- \lambda\alpha)\|x - y\| $$

Note that when $\lambda \leq \frac{2}{\alpha + \beta}$, that $1 - \lambda\alpha = \max\{|1 - \lambda\alpha|, |1 - \lambda\beta|\}$:

  • $1 - \lambda\alpha \geq 1- \lambda\beta$ follows from $\beta > \alpha$
  • $1 - \lambda\alpha \geq \lambda\beta - 1 \iff 2 \geq \lambda(\alpha + \beta) \iff \frac{2}{\alpha + \beta} \geq \lambda$ which is true by assumption.
  • $1 - \lambda\alpha \geq \alpha\lambda - 1$ follows from the above fact and the fact that $\beta > \alpha$.

Thus we have proved that gradient descent is a $\max \{|1 - \lambda\alpha|, |1 - \lambda\beta|\}$ contraction for all $\lambda \leq \frac{2}{\alpha+\beta}$. For $\lambda \geq \frac{2}{\alpha +\beta}$, note that the second term on the RHS of our inequality is nonnegative, and thus we can replace $\|\nabla f(x) - \nabla f(y)\|$ with an upper bound using $\beta$-smoothness:

$$ \|\nabla f(x) - \nabla f(y)\| \leq \beta \|x - y\| $$

Plugging this in yields:

$$ \|Tx - Ty \|^{2} \leq \left(1 - 2\lambda\left(\frac{\alpha\beta}{\alpha + \beta}\right)\right)\|x - y\|^{2} + \left(\lambda^{2}\beta^{2} - \frac{2\beta^{2}\lambda}{\alpha + \beta}\right)\|x - y\|^{2} $$

Simplifying and factorising as before we find that:

$$ \|Tx - Ty\|^{2} \leq (\lambda\beta - 1)^{2}\|x-y\|^{2} \iff \|Tx - Ty\| \leq (\lambda\beta - 1)\|x - y\| $$

Similarly to before, when $\lambda \geq \frac{2}{\alpha + \beta}$, we see that $\lambda\beta - 1 = \max \{|1 - \lambda\alpha|, |1 - \lambda\beta|\}$. Thus we have shown that gradient descent is a $\max \{|1 - \lambda\alpha|, |1 - \lambda\beta|\}$ contraction for all $\lambda$ such that $\frac{2}{\beta} > \lambda \geq \frac{2}{\alpha + \beta}$.

Combining both cases ($\lambda \geq \frac{2}{\alpha+\beta}$, $\lambda \leq \frac{2}{\alpha +\beta}$) we see that gradient descent is a $\max \{|1 - \lambda\alpha|, |1 - \lambda\beta|\}$ contraction for all $\lambda \in \left(0, \frac{2}{\beta}\right)$, i.e.

$$ \|Tx - Ty\| \leq \max\{|1 - \lambda\alpha|, |1 - \lambda\beta|\}\|x - y\| $$

It now also immediately obvious why $\frac{2}{\beta}$ is an upper bound on the step size. Setting $\lambda \geq \frac{2}{\alpha + \beta}$ we have

$$ \lambda\beta - 1 = \max\{|1 - \lambda\alpha|, |1 - \lambda\beta|\} $$

But

$$ \lambda\beta - 1 < 1 \iff \lambda < \frac{\beta}{2} $$

So bounding $\lambda$ above by $\frac{2}{\beta}$ ensures $\max\{|1 - \lambda\alpha|, |1 - \lambda\beta|\}$ is less than 1.