I am trying to prove that gradient descent on a $\alpha$-strongly convex, $\beta$-smooth function $f$ is a contractive operator. I am looking for a general proof which does not assume existence of the second derivative, that also obtains the optimal bound, which I believe is $\max\{\left|1 – \lambda\alpha\right|, \left|1 – \lambda\beta\right|\}$, where $\lambda$ is the step size (see here page 15 (page 13 using the pdf page counter)). This is my effort so far:
Gradient descent can be written in the form of the operator $T = (I – \lambda\nabla f)$.
$$
\|Tx – Ty \|^{2} = \|x – \lambda\nabla f(x) – y +\lambda \nabla f(y) \|^{2} \\
= \|(x- y) – \lambda(\nabla f(x) – \nabla f(y))\|^{2} \\
= \|x – y\|^{2} + \lambda^{2}\|\nabla f(x) – \nabla f(y)\|^{2} – 2\lambda\left<x – y, \nabla f(x) – \nabla f(y)\right>
$$
By using the co-coercivity property of the $(\beta – \alpha)$ smooth function $f(x) – \frac{\alpha}{2}\|x\|^{2}$. We get the bound:
$$
\left<x – y, \nabla f(x) – \nabla f(y)\right> \geq \frac{\alpha\beta}{\alpha + \beta}\|x – y\|^{2} + \frac{1}{\alpha + \beta}\|\nabla f(x) – \nabla f(y)\|^{2}
$$
Combining this inequality with the above we have:
$$
\|Tx – Ty \|^{2} \leq \|x – y\|^{2} + \lambda^{2}\|\nabla f(x) – \nabla f(y)\|^{2} – 2\lambda\left(\frac{\alpha\beta}{\alpha + \beta}\|x – y\|^{2} + \frac{1}{\alpha + \beta}\|\nabla f(x) – \nabla f(y)\|^{2}\right)
$$
Simplifying we have:
$$
\|Tx – Ty \|^{2} \leq \left(1 – 2\lambda\left(\frac{\alpha\beta}{\alpha + \beta}\right)\right)\|x – y\|^{2} + \left(\lambda^{2} – \frac{2\lambda}{\alpha + \beta}\right)\|\nabla f(x) – \nabla f(y)\|^{2}
$$
Note setting $\lambda = \frac{2}{\alpha + \beta}$ recovers the convergence rate for gradient descent here. How do I proceed from this point? One idea I have is to try and ignore the last term by ensuring
$$
\lambda^{2} – \frac{2\lambda}{\alpha + \beta} \leq 0 \implies \lambda \leq \frac{2}{\alpha + \beta}
$$
This gives the inequality:
$$
\|Tx – Ty \|^{2} \leq \left(1 – 2\lambda\left(\frac{\alpha\beta}{\alpha + \beta}\right)\right)\|x – y\|^{2}
$$
EDIT: I guess we also need to ensure that
$$
\left(1 – 2\lambda\left(\frac{\alpha\beta}{\alpha + \beta}\right)\right) \geq 0 \implies \lambda \leq \frac{\alpha + \beta}{2\alpha\beta}
$$
But then I am unsure on how to proceed from here. Perhaps there is a way I can use $\beta$-smoothness instead to bound the $\|\nabla f(x) – \nabla f(y)\|$ term?
Best Answer
Okay, the proof is actually simple from this point. Choose the domain $\lambda \in (0, \frac{2}{\beta})$. This is chosen as at $\frac{2}{\beta}$, the map becomes non-expansive (this is easy to verify).
Recall the bound we had earlier:
$$ \|Tx - Ty \|^{2} \leq \left(1 - 2\lambda\left(\frac{\alpha\beta}{\alpha + \beta}\right)\right)\|x - y\|^{2} + \left(\lambda^{2} - \frac{2\lambda}{\alpha + \beta}\right)\|\nabla f(x) - \nabla f(y)\|^{2} $$
When $\lambda \leq \frac{2}{\alpha +\beta}$, the second term on the right-hand side is nonpositive and thus we can replace $\|\nabla f(x) - \nabla f(y)\|$ with a lower bound. We will use the lower bound that follows trivially from $\alpha$-strong convexity:
$$ \|\nabla f(x) - \nabla f(y)\| \geq \alpha \|x - y\| $$
Plugging this in then gives:
$$ \|Tx - Ty \|^{2} \leq \left(1 - 2\lambda\left(\frac{\alpha\beta}{\alpha + \beta}\right)\right)\|x - y\|^{2} + \left(\lambda^{2}\alpha^{2} - \frac{2\alpha^{2}\lambda}{\alpha + \beta}\right)\|x - y\|^{2} $$
Rearranging and simplifying then yields:
$$ \|Tx - Ty\|^{2} \leq (\lambda{2}\alpha^{2} - 2\lambda\alpha + 1)\|x-y\|^{2} $$
Now factorising we have
$$ \|Tx - Ty\|^{2} \leq (1- \lambda\alpha)^{2}\|x-y\|^{2} \iff \|Tx - Ty\| \leq (1- \lambda\alpha)\|x - y\| $$
Note that when $\lambda \leq \frac{2}{\alpha + \beta}$, that $1 - \lambda\alpha = \max\{|1 - \lambda\alpha|, |1 - \lambda\beta|\}$:
Thus we have proved that gradient descent is a $\max \{|1 - \lambda\alpha|, |1 - \lambda\beta|\}$ contraction for all $\lambda \leq \frac{2}{\alpha+\beta}$. For $\lambda \geq \frac{2}{\alpha +\beta}$, note that the second term on the RHS of our inequality is nonnegative, and thus we can replace $\|\nabla f(x) - \nabla f(y)\|$ with an upper bound using $\beta$-smoothness:
$$ \|\nabla f(x) - \nabla f(y)\| \leq \beta \|x - y\| $$
Plugging this in yields:
$$ \|Tx - Ty \|^{2} \leq \left(1 - 2\lambda\left(\frac{\alpha\beta}{\alpha + \beta}\right)\right)\|x - y\|^{2} + \left(\lambda^{2}\beta^{2} - \frac{2\beta^{2}\lambda}{\alpha + \beta}\right)\|x - y\|^{2} $$
Simplifying and factorising as before we find that:
$$ \|Tx - Ty\|^{2} \leq (\lambda\beta - 1)^{2}\|x-y\|^{2} \iff \|Tx - Ty\| \leq (\lambda\beta - 1)\|x - y\| $$
Similarly to before, when $\lambda \geq \frac{2}{\alpha + \beta}$, we see that $\lambda\beta - 1 = \max \{|1 - \lambda\alpha|, |1 - \lambda\beta|\}$. Thus we have shown that gradient descent is a $\max \{|1 - \lambda\alpha|, |1 - \lambda\beta|\}$ contraction for all $\lambda$ such that $\frac{2}{\beta} > \lambda \geq \frac{2}{\alpha + \beta}$.
Combining both cases ($\lambda \geq \frac{2}{\alpha+\beta}$, $\lambda \leq \frac{2}{\alpha +\beta}$) we see that gradient descent is a $\max \{|1 - \lambda\alpha|, |1 - \lambda\beta|\}$ contraction for all $\lambda \in \left(0, \frac{2}{\beta}\right)$, i.e.
$$ \|Tx - Ty\| \leq \max\{|1 - \lambda\alpha|, |1 - \lambda\beta|\}\|x - y\| $$
It now also immediately obvious why $\frac{2}{\beta}$ is an upper bound on the step size. Setting $\lambda \geq \frac{2}{\alpha + \beta}$ we have
$$ \lambda\beta - 1 = \max\{|1 - \lambda\alpha|, |1 - \lambda\beta|\} $$
But
$$ \lambda\beta - 1 < 1 \iff \lambda < \frac{\beta}{2} $$
So bounding $\lambda$ above by $\frac{2}{\beta}$ ensures $\max\{|1 - \lambda\alpha|, |1 - \lambda\beta|\}$ is less than 1.