Objective function decreases monotonically under gradient descent if it is strongly convex

convex optimizationgradient descentnumerical optimizationoptimization

I'm trying to prove the following:

Suppose $f$ is strongly convex, meaning for some $m > 0$ that

$$f(y) \geq f(x) + \nabla f(x)^T(y – x) + \frac{m}{2} \|y – x\|^2$$

then the gradient descent procedure

$$x^{t + 1} = x^t – \eta \nabla f(x^t)$$

satisfies $f(x^{t + 1}) \leq f(x^t)$ for all $\eta$ sufficiently small.

This seems pretty straightforward, but if I set $y = x^t$ and $x = x^{t + 1}$ in the definition of strong convexity, I get

$$f(x^t) – f(x^{t + 1}) \geq \eta \nabla f(x^{t + 1})^T \nabla f(x^t) + \frac{m}{2} \eta^2 \|\nabla f(x^t)\|^2$$

But now it's not clear to me that this is non-negative for small $\eta$, since I can't see what we can say about

$$\nabla f(x^{t + 1})^T \nabla f(x^t)$$

Any ideas would be appreciated. Thanks!

Best Answer

I am assuming that $\nabla f$ is locally Lipschitz.

If $f$ is strongly convex, then the level sets $L_\alpha = \{ x | f(x) \le \alpha \}$ are compact. Note that if the $f(x_k)$ are non increasing then $x_k \in L_{f(x_0)}$ for all $k$.

The strong convexity is of no consequence in terms of showing that a fixed step size will work other than having compact level sets. The reason a small fixed step size would not work is because of too much curvature rather than too little. (A lower bound is useful producing a convergence rate.)

Pick some $x_0$, let $\Lambda= \{ x | y \in L_{f(x_0)}, \|x-y\| \le 1 \}$. let $M= \sup_{x \in \Lambda} \| \nabla f(x) \|$. The $1$ in the definition of $\Lambda$ is just a bit of slop so that the estimates work below for sufficiently small step size.

Let $L$ be a bound on the Lipschitz rank of $\nabla f$ for $x \in \Lambda$.

If $x,x+h \in \Lambda$ then $f(x+h) = f(x) + \langle \nabla f(x), h \rangle + \langle \int_0^1 (\nabla f(x+th)-\nabla f(x)) dt , h \rangle \le f(x) + \langle \nabla f(x), h \rangle +{L\over 2}\|h\|^2$.

Now let $h=-\lambda \nabla f(x)$ and assume that $0 \le \lambda \le {1 \over M}$ so that $x+h \in \Lambda$, then $f(x-\lambda \nabla f(x)) \le f(x)- \lambda \|\nabla f(x)\|^2+{\lambda^2L\over 2} \|\nabla f(x)\|^2 = f(x) - \lambda \|\nabla f(x)\|^2(1-{\lambda L\over 2})$ and hence as long as $0 \le \lambda \le \min({1 \over M}, {2 \over L})$, we have $f(x-\lambda \nabla f(x)) \le f(x)$.

Best Answer

Related Solutions

[Math] Examples where constant step-size gradient descent fails everywhere

[Math] Does “approximate” gradient descent work

Related Question