[Math] Motivation for Mirror-Descent

convergence-divergenceduality-theoremsnumerical optimizationoptimization

I am trying to wrap my head around why mirror descent is such a popular optimization algorithm. Based on my reading, it seems like the main reason is that it improves upon the convergence rate of subgradient descent, while only using full gradient information of a strongly convex function $\psi$ (acting as a term in a Bregman divergence).

Additionally, there seems to be hints that based on the choice of $\psi$, you might also have decomposition advantages.

So my questions are:

1) If the full gradient information of the objective function is used, does mirror descent have any convergence advantage over gradient descent?

2) What is an example where using $\psi \neq \|\cdot\|_2^2$ is advantageous?

3) What exactly is the connection of mirror descent with duality, e.g. what interpretation motivates this method?

Thank you!

Best Answer

The advantage of using mirror descent over gradient descent is that it takes into account the geometry of the problem through the potential function $\Phi$.

We can see mirror descent as a generalization of the projected gradient descent, which ordinarily is based on an assumed euclidean geometry. Projected gradient descent is a special case of mirror descent using the potential $\Phi = \|\cdot\|^2/2$. But sometimes there are other choices of $\Phi$ that can better model the geometry of our problem. Examples include $x\log x$ or $-\log x$, both of which are useful for problems over the positive orthant, simplex, etc.

To answer 1), it depends on the geometry of the problem. In high dimensions (large scale optimization), it can be advantageous to abandon the euclidean geometry to improve the constants in our convergence rates. As an example, consider the differentiable convex function $f$ on the Euclidean ball $B_{2,n}$ with $\|\nabla f(x)\|_{\infty}\leq 1$ for all $x\in B_{2,n}$. Then, $|\nabla f(x)|_2\leq \sqrt{n}$ and the projected gradient descent converges at a rate of $\sqrt{\frac{n}{k}}$. However, using mirror descent with an appropraitely chosen $\Phi$ we can get a rate of $\sqrt{\frac{\log(n)}{k}}$. When $n$ is quite large, this is a huge improvement. The key step is that our choice of $\Phi$ (potential) determines the new geometry.

To answer 2), many times when considering probability distributions the euclidean distance is lackluster and we prefer to use the so called Kullback Liebler divergence to measure the 'distance' between two probability distributions. In such scenarios, mirror descent with $\Phi$ set to be the entropy provides better constants in our convergence rates.

To answer 3), we map our primal point $x$ to the dual space (through the mapping $\nabla \Phi$) and take a step in the direction given by the gradient of our function, then we map back to the primal space (through $\nabla\Phi^{-1}$) after that step. For spaces which are reflexive (i.e. $X^{**} = X$) this is not a big deal, for example in Hilbert spaces we do not worry so much since the norm arises naturally from an inner product and our space is isomorphic to its dual already. However, in a more general context (any banach space), we have to be careful about handling this duality.

I know this is a vague answer that doesn't quite answer your questions, but I hope it helps you find the right way.

Related Question