Choose the constant of strong convexity

convex optimizationconvex-analysisgradient descentnumerical methodsoptimization

Knowing or estimating the strong convexity parameter $m$ is extremely important when studying convergence rate. Different $m$ produces different rates, but I am puzzled that there seems to be multiple $m$ that can be chosen.

Recall that a function $f: \mathbb{R}^m \to \mathbb{R}$ is strongly convex if
$x,y\in \mathbb{R}^n$ and $t\in[0,1]$ it follows
$$f(tx+(1−t)y)\leq tf(x)+(1−t)f(y)−mt(1−t)\|x−y\|^2,$$
where $m>0$.

Note in the definition, it is just some $m$.

which implies, whenever $f$ is differentiable,

$$(x-y)\cdot (\nabla f(x)-\nabla f(y))\geq m||x-y||^2$$
$$f(y)\geq f(x)+\nabla f(x)\cdot(y-x)+\frac{m}{2}||x-y||^2$$

My question is, does $m$ matter when we say that a function is $m$-strongly convex? Does any $m$ work or does the smallest/largest one work?

As concrete examples,

  1. Is $1/2 \langle x, x\rangle$ is $1$ strongly convex or any $m \geq 1$ strongly convex? If the latter, then why do so many people say it is $1$ strongly convex?

  2. Say that a function satisfies $(x-y)\cdot (\nabla f(x)-\nabla f(y)) > 0$. There is probably some constant factor $m$, extremely small, that makes this equality true. In this case, we don't really know what $m$ is, so what is the parameter to use to estimate the rate of convergence for algorithms such as gradient descent where its convergence rate depends on $m$?

See: https://blogs.princeton.edu/imabandit/2013/04/04/orf523-strong-convexity/

Best Answer

For concrete example 1:

The function $f(x) = \frac{1}{2}||x||^{2}$ is $m$-strongly convex, with $m\in (-\infty, 1]$. We say it is $1$ strongly convex because this is the largest $m$ where the inequality, $(\nabla f(x)-\nabla f(y))^{T}(x-y) \geq m ||x-y||^{2}$ holds.

For concrete example 2:

If $(\nabla f(x)-\nabla f(y))^{T}(x-y) > 0$ it is not necessarily true that there is "some constant factor m, extremely small, that makes this equality true". For example $f(x) = \frac{1}{4}x^{4}$ is convex but there is no $m>0$ where $(\nabla f(x)-\nabla f(y))^{T}(x-y) \geq m ||x-y||^{2}$, i.e., If it is possible then $(x^3-y^3)(x-y) \geq m ||x-y||^{2}$ and setting $y=0$ we get that $x^4 \geq m x^2 \implies x^2(x^2-m) \geq 0$. But, when $x < \sqrt{m}$, then $x^2(x^2-m)<0$ therefore we can always find an $x$ small enough where $(\nabla f(x)-\nabla f(y))^{T}(x-y) \geq m ||x-y||^{2}$ doesn't hold for any $m>0$.

Question: does $m$ matter when we say that a function is $m$-strongly convex?

The reason that $m$ matters is because algorithms such as gradient descent will not converge if the step size $\alpha_{k}$ isn't selected appropriately, where $k$ is $k$th iteration of the algorithm. There are many different step size rules depending on the assumptions made on $f$. You can use diminishing step sizes $\alpha_{k} = \frac{1}{k}$ but as the algorithm runs longer the steps made will get smaller and smaller, i.e., slower. The algorithm would converge faster if we could use a constant step size, $\alpha_k = \alpha$. For the class of $m$-strongly convex function that are $L$-Lipschitz then gradient descent will converge if $\alpha \in (0,\frac{2}{L})$, page 15 (pdf page 13) https://stanford.edu/~boyd/papers/pdf/monotone_primer.pdf. The optimal step size is when $\alpha = \frac{2}{m+L}$. The reason why $m$ is important is that it improves the convergence rate since the optimal step size directly depends on $m$. If we use a smaller $m$ then $\alpha$ will be smaller and we will converge slower. Additionally, depending on the algorithm or type of problem you are looking at the bounds on a constant step size might directly depend on $m$. For example, using gradient descent to find a zero of a monotone operator for $m$-strongly monotone and $L$-lipschitz problems need $\alpha \in (0, \frac{2m}{L^{2}})$ page 16 (pdf page 14) https://stanford.edu/~boyd/papers/pdf/monotone_primer.pdf.

Related Question