Unconstrained Mirror Descent Algorithm

convex optimizationoptimization

I wanted to know what the formula for mirror descent is. (I'm only looking at the unconstrained case) I'm trying to set up mirror descent, I am starting with the fact it can be thought as minimising this
$$
p_k(y) = f(x_k) + \langle\nabla{f(x_{k})},y-x_{k}\rangle + \dfrac{1}{h_k}B_w(y,x_k)
$$
Where $B_w$ is the bregman divergence.

But then I'm not too sure how you get the update formula out from this?

Would it just be

$$
\nabla{w}(x_{k+1}) = \nabla{w}(x_{k}) – \nabla{f}(x_k)
$$

If that is the case how do I get that from the above minimisation problem

Best Answer

Since the Bregman divergence is a convex function, you can get the update formula from the first-order optimality condition:

$$ 0 = \nabla_y \left\{ f(x_k) + \langle \nabla f(x_k), y - x_k \rangle + \frac{1}{h_k} \left(w(y) - w(x_k) - \langle \nabla w(x_k), y - x_k \rangle \right) \right\} $$ which leads to the update formula (here $y$ is your $x_{k+1}$):

$$ \nabla w(y) = \nabla w(x_k) - h_k \nabla f(x_k). $$

Related Solutions

[Math] Strong convexity of squared $\ell_p$ norm in Bregman divergence

I see this question is very old, but hopefully someone else finds it and can benefit from an answer.

Answer: No.

First, this clever solution demonstrates the failure when $p=1$ or $\infty$ which was your original interest.

Strict/Strong convexity of non-Euclidean norm

Thank you @daw for pointing out an error in my original computation. The answer is still "no" when $p > 2$ and $n \ge 2$. In a new comment after the (fixed) version of my original answer, I explain why the answer is "yes" when $n=1$, with any norm in place of $\| \cdot \|_{p}$.

Let $F(x) = \frac{1}{2} \|x\|_{p}^{2}$. Assume $2 < p < \infty$ so that $\| \cdot \|_{p}$ (and consequently $F$) is $C^{2}$.

In particular, if $x$ is near enough to $x_{0}$, $$ F(x) = F(x_{0}) + DF(x_{0}) \cdot (x-x_{0}) + \frac{1}{2}(x-x_{0})^{\perp} D^{2}F(x_{0}) \cdot (x-x_{0}) + o(|x-x_{0}|^{2}). $$

So, we compute $$\partial_{i}F(x) = \|x\|_{p}^{2-p} x_{i} |x_{i}|^{p-2}$$ and $$\partial_{ij}F(x) = \delta^{i}_{j} (p-1) \left|\frac{x_{i}}{\|x\|_{p}} \right|^{p-2} - (p-2) \frac{x_{i} |x_{i}|^{p-2}}{\|x\|_{p}^{p-1}} \frac{x_{j} |x_{j}|^{p-2}}{\|x\|_{p}^{p-1}}.$$

Let $e_{1}$ denote the first standard basis vector. Then, $$ \partial_{ij}F(e_{1}) = \begin{cases} 1 & (i,j) = (1,1) \\ 0 & else \end{cases}. $$ In particular, if $n = 2$ (or if $n$ is larger, pad all matrices/vectors below with zeroes) when $x = e_{1} + t e_{2}$ we have $$ (x-x_{0})^{\perp} (D^{2}F)(e_{1})(x-x_{0}) = \begin{pmatrix} 0 & t \end{pmatrix} \begin{pmatrix} 1 & 0 \\ 0 & 0 \end{pmatrix} \begin{pmatrix} 0 \\ t \end{pmatrix} = 0. $$ So, using the Taylor series approximation for $t$ small to bound $\Delta_{p}(e_{1}+ te_{2}, e_{1})$, we discover $$ |\Delta_{p}(e_{1}+ t e_{2},e_{1})| \le o(t^{2}). $$

It might be interesting to convert this into the language of convex functions and see if a similar statement holds for $1 < p < 2$.

Added when corrections were done: Note that if $f \in C^{2}(\mathbb{R}^{n} \setminus \{0\})$ is any norm and $F(x) = \frac{1}{2} f(x)^{2}$, then $DF(x) = f(x) Df(x)$ and $D^{2}f(x) = f(x) D^{2}f(x) + Df(x) (Df(x))^{\perp}$.

Since $f$ is one-homogeneous, $Df(x)$ is non-zero (indeed, $Df(x) \cdot x = f(x) \neq 0$ due to non-degeneracy of norms) and $Df(x) ( Df(x))^{\perp}$ is therefore a rank-one, positive semidefinite matrix.

Since $f$ is convex, $f(x) D^{2}f(x)$ is positive semidefinite. In particular, the sum, $D^{2}F(x) = f(x) D^{2}f(x) + Df(x) (Df(x))^{\perp}$ has all non-negative eigenvalues and at least one that is positive. This and the above Taylor series expansion shows why there at least exists $c_{x}$ so that $\Delta_{p}(x,y) \ge c_{x} \|x-y\|^{2}$ for$\|x-y\|$ small when $n=1$, and in fact this holds for any norm in place of $\| \cdot \|_{p}$.

Global Newton Method Algorithm for Optimisation

First there's the concept of taking a quadratic approximation of the objective function. This looks like

$$f(x) \approx f(x_0) + \nabla f(x_0) \cdot (x-x_0) + \frac{1}{2} (x-x_0)^T \nabla^2 f(x_0) (x-x_0).$$

If $\nabla^2 f(x_0)$ is positive definite then the minimum of this function is located at $x=x_0-(\nabla^2 f(x_0))^{-1} \nabla f(x_0)$. (This is basically the multivariate version of the $x=-b/(2a)$ formula for the vertex of a parabola.) One can in principle hope that this is somewhere at least closer to the minimum of the original $f$ than $x_0$ was.

This is just the Newton method for optimization. In the end it is the Newton method for finding zeros just applied to the function $\nabla f$.

This method is doing some other stuff too. First off in step 3 it decides whether to use the Newton search direction or the gradient descent search direction, based on which one looks like it would be better. The parameters $p$ and $q$ determine the threshold for doing this. I don't have much intuition for why that particular scheme with $p$ and $q$ makes sense.

Having picked a direction you line search along that direction. You want to go a long ways so that things have a good chance to improve but you don't want to go too far to the point that things aren't really improving at all. This particular method is asking that the actual decrease in $f$ be not too much worse than it should be based on the directional derivative, where "how much worse is acceptable" is controlled by the parameter $\delta$. Because they impose $\delta \in (0,1)$, this line search will always terminate (albeit perhaps for a problematically small $\alpha$) by the definition of the directional derivative. The original Newton method uses $\alpha=1$ which is often not reasonable if your current guess is not very good.

The last step should be self-explanatory.

Best Answer

Related Solutions

[Math] Strong convexity of squared $\ell_p$ norm in Bregman divergence

Global Newton Method Algorithm for Optimisation

Related Question