Practical step-size selection for Entropic Mirror Descent

gradient descentoptimizationsimplex

I am considering using Entropic Mirror Descent (MD) for minimization of $f(x)$ over the probability simplex:

$$x\succcurlyeq 0,\quad \mathbf 1^\mathrm T x=1.$$

My pick is MD, partly because it fits the geometry (hence conceptually nice), and partly since my problem has a high number of dimensions, $n \approx 10000$, and therefore I'd expect MD's convergence rate of $\propto \sqrt{log n}$ to make a difference (relative to say, projected gradients).

I have two practical questions regarding its use:

What is a good way to select a step size? Most literature seems to point to theoretical results that depend on the Lipschitz constant and the strong convexity constant of the objective, which are not easy to compute for an arbitrary function. Is practical step selection the same as in standard gradient descent, e.g., backtracking line search?
From a practical standpoint again, is it better to use something like the augmented Lagrangian to solve a problem with such constraints? "better" in the sense of having more well-studied algorithms or better implementations available? In this case, of course an additional concern is, it ends up adding $O(n)$ slack variables: does this imply MD is still probably the better choice?

Grateful for any pointers, thanks!

Best Answer

In Beck's First-order methods in optimization, it is stated regarding MD that $$if\quad \frac{\sum_{n=0}^k t_n^2}{\sum_{n=0}^k t_n} \to 0 \quad as \quad k\to\infty \quad then \quad f_{best}^k\to f_{opt}.$$ However there is no complexity estimate for such a stepsize strategy, all we know it that eventually the algorithm will converge to the optimal value. Meaning that for MD you cannot avoid some knowledge on the smoothness or strong convexity parameter of your function.

You are correct that for a non-euclidean setting (simplex), MD is a better choice than Projected Gradient, but there is even a better choice then both of them, which is also better than augmented Lagrangian, and can be used with a backtracking procedure without prior knowledge of smoothness / strong convexity.

FISTA is an accelerated proximal gradient algorithm which is a faster than MD/PG. It's complexity is $O(\frac{1}{k^2})$, which is optimal for first-order methods. Moreover, the per-iteration complexity is the same as MD/PG, which is usually not the case for augmented Lagrangian. The backtracking procedure is not very different to prox-grad (PG is a special case of prox-grad) and all details regarding it can be found here: https://web.iem.technion.ac.il/images/user-files/becka/papers/71654.pdf

Best Answer

Related Solutions

Optimization – Finding the Optimal Step Size in Gradient Descent

[Math] Examples where constant step-size gradient descent fails everywhere

Related Question