Bounds on successive steps of projected gradient descent

convex optimizationgradient descentmachine learningnumerical optimizationoptimization

Let $f:\mathbb{R}^n\rightarrow\mathbb{R}$ be a continuously differentiable strongly convex function with a globally $L$-Lipschitz continuous gradient. The normal gradient method for computing the unconstrained minimizer $x^{\star}$ is
$$
x_{k+1} = x_k-\alpha \nabla f(x_k).
$$
Successive steps satisfy the bound
$$
\begin{aligned}
\|x_{k+1} – x_k\|_2 &= \alpha \|\nabla f(x_k)\|_2 = \alpha \|\nabla f(x_k)-\nabla f(x^{\star})\|_2\\
&\leq \alpha L\|x_{k}-x^{\star}\|_2.
\end{aligned}
$$
where I used $\nabla f(x^{\star}) = 0$. This bound has two features (i) it is linear in $\|x_k-x^{\star}\|_2$ and (ii) it is linear in $\alpha$ … very intuitive of course, if the step size is small, you don't move much.

I am looking for a similar bound for the projected gradient descent method
\begin{align}\tag{1}\label{Eq:PGD}
x_{k+1} = \mathrm{Proj}_{X}(x_{k}-\alpha \nabla f(x_k))
\end{align}
where $X$ is a closed non-empty convex set. I will let $\bar{x} \in X$ denote the constrained optimizer. For me, linearity of the bound in $\|x_{k}-\bar{x}\|_2$ is crucial, and it should have the property that it tends to zero as $\alpha \to 0$ (though I can tolerate lack of linearity there). I am recording my best attempts at this below.

Preliminary Inequalities. Let $\mathcal{N}_{X}(x)$ denote the normal cone of $X$ at the point $x \in X$. Recall that
$$
\mathcal{N}_{X}(x) := \{y \in \mathbb{R}^n\,\,:\,\,y^{\sf T}(z-x) \leq 0\quad \forall z \in X\}.
$$
It is well known $\mathrm{Proj}_{X}$ is the so-called resolvent operator associated with the normal cone, expressed as $\mathrm{Proj}_{X}(x) = (\mathrm{Id}+\mathcal{N}_{X})^{-1}(x)$. With this, the PGD update law (\ref{Eq:PGD}) and the optimality condition can be equivalently written as
$$
\begin{aligned}
x_{k} – x_{k+1} – \alpha \nabla f(x_k) &\in \mathcal{N}_{X}(x_{k+1})\\
-\alpha \nabla f(\bar{x}) &\in \mathcal{N}_{X}(\bar{x})
\end{aligned}
$$
Expressing these in terms of the definition of the normal cone, the previous inclusions are precisely that
$$
\begin{aligned}
(x_{k}-x_{k+1}-\alpha \nabla f(x_k))^{\sf T}(z-x_{k+1}) &\leq 0, \quad z \in X\\
-\alpha \nabla f(\bar{x})^{\sf T}(z-\bar{x}) &\leq 0, \quad z \in X
\end{aligned}
$$
In particular, the first holds at $z = x_k$ and $z = \bar{x}$, which quickly yields the two inequalities
\begin{align}\label{Eq:FNE1-a}\tag{2}
\|x_{k+1}-x_{k}\|_2^2 &\leq -\alpha (x_{k+1}-x_{k})^{\sf T}\nabla f(x_k)\\
\label{Eq:FNE1-b}\tag{3}
(x_{k+1}-x_{k})^{\sf T}(x_{k+1}-\bar{x}) &\leq -\alpha (x_{k+1}-\bar{x})^{\sf T}\nabla f(x_k).
\end{align}
A simple identity to re-express the LHS of (\ref{Eq:FNE1-b}) is
\begin{align}\label{Eq:FNE1-c}\tag{4}
(x_{k+1}-x_{k})^{\sf T}(x_{k+1}-\bar{x}) = \tfrac{1}{2}\|x_{k+1}-x_{k}\|_2^2 – \tfrac{1}{2}\left(\|x_{k}-\bar{x}\|_2^2 – \|x_{k+1}-\bar{x}\|_2^2\right)
\end{align}

These two inequalities are in fact equivalent to what one can obtain by applying firm non-expansiveness of the projection operator, which yields the alternative (but again, equivalent) expressions
\begin{equation}\label{Eq:FNE2}
\begin{aligned}
\|x_{k+1}-x_{k}\|_2^2 & \leq \alpha^2 \|\nabla f(x_k)\|_2^2 – \|x_{k} – x_{k+1} – \alpha \nabla f(x_k)\|_2^2\\
\|x_{k+1}-\bar{x}\|_2^2 &\leq \|x_{k}-\alpha \nabla f(x_k) – \bar{x}\|_2^2 – \|x_{k} – x_{k+1} – \alpha \nabla f(x_k)\|_2^2.
\end{aligned}
\end{equation}
Similarly, evaluating the optimality conditions at $z = x_k$ and $z = x_{k+1}$ we obtain
\begin{align}\label{Eq:Optimality-a}\tag{5}
\alpha (x_{k}-\bar{x})^{\sf T}\nabla f(\bar{x}) &\geq 0\\
\label{Eq:Optimality-b}\tag{6}
\alpha (x_{k+1}-\bar{x})^{\sf T}\nabla f(\bar{x}) &\geq 0.
\end{align}

Attempt #1 — Simple Upper Bound Using non-expansiveness of the projection and $x_k = \mathrm{Proj}_{X}(x_k)$ one can compute
$$
\begin{aligned}
\|x_{k+1}-x_{k}\| &= \|\mathrm{Proj}_{X}(x_{k}-\alpha \nabla f(x_k)) – \mathrm{Proj}_{X}(x_k)\|_2\\
&\leq \alpha \|\nabla f(x_k)\|_2\\
&= \alpha \|\nabla f(x_k)-\nabla f(\bar{x}) + \nabla f(\bar{x})\|_2\\
&\leq \alpha L \|x_k-\bar{x}\|_2 + \alpha \|\nabla f(\bar{x})\|_2
\end{aligned}
$$

This resembles the above, but with an additional constant term quantifying how active the gradient of $f$ is at the constrained optimizer. The bound however does not force $\|x_{k+1}-x_{k}\|_2$ to go to zero as $x_k \to \bar{x}$, which is what should happen.

Attempt #2(a) Beginning from the preliminary inequality (\ref{Eq:FNE1-a}), we have
$$
\begin{aligned}
\|x_{k+1}-x_{k}\|_2^2 &\leq -\alpha (x_{k+1}-x_{k})^{T}\nabla f(x_k)\\
&= -\alpha (x_{k+1}-\bar{x}+\bar{x}-x_{k})^{T}\nabla f(x_k)\\
&\leq \alpha (\|x_{k+1}-\bar{x}\|_2 + \|x_{k}-\bar{x}\|_2) \|\nabla f(x_k)-\nabla f(\bar{x})+\nabla f(\bar{x})\|_2\\
\end{aligned}
$$
For small enough step sizes the forward-backward step is a contraction operator and the iterates satisfy
$$
\|x_{k+1}-\bar{x}\|_2 \leq \sqrt{(1-\alpha \mu)} \|x_k-\bar{x}\|_2 = c_{\alpha}\|x_k-\bar{x}\|.
$$
It therefore follows that
$$
\begin{aligned}
\|x_{k+1}-x_{k}\|_2^2 &\leq \alpha (1+c_{\alpha})\|x_{k}-\bar{x}\|_2 (L\|x_{k}-\bar{x}\|_2 + \|\nabla f(\bar{x})\|_2)\\
&= \alpha L(1+c_{\alpha})\|x_k-\bar{x}\|_2^2 + \alpha (1+c_{\alpha})\|\nabla f(\bar{x})\|_2\|x_{k}-\bar{x}\|_2.
\end{aligned}
$$
So we conclude that
$$
\|x_{k+1}-x_{k}\|_2 \leq \sqrt{2\alpha L\|x_k-\bar{x}\|_2^2 + 2\alpha \|\nabla f(\bar{x})\|_2\|x_{k}-\bar{x}\|_2},
$$
On any ball of size $r_0$ around the point $\bar{x}$, we can therefore obtain a bound of the form
$$
\|x_{k+1}-x_{k}\|_2 \leq \sqrt{\alpha\|x_{k}-\bar{x}\|_2}\sqrt{2Lr_0 + 2\|\nabla f(\bar{x})\|_2},
$$
This has the right qualitative behaviour, but it has square-root dependence in $\alpha$ and $\|x_k-\bar{x}\|_2$; for my application of interest, the latter is problematic.

Attempt #2(b) If we instead begin from the other preliminary inequality (\ref{Eq:FNE1-b}) we can compute that
$$
\begin{aligned}
2(x_{k+1}-x_{k})^{\sf T}(x_{k+1}-\bar{x}) &\leq – 2\alpha (x_{k+1}-\bar{x})^{\sf T}[\nabla f(x_k)]\\
&= – 2\alpha (x_{k+1} – \bar{x})^{\sf T}[\nabla f(\bar{x}) + \nabla f(x_k) – \nabla f(\bar{x})]\\
&= – 2\alpha(x_{k+1} – \bar{x})^{\sf T}\nabla f(\bar{x}) – 2\alpha (x_{k+1}-x_{k} + x_{k}-\bar{x})^{\sf T}(\nabla f(x_k) – \nabla f(\bar{x}))\\
&= – 2\alpha(x_{k+1} – \bar{x})^{\sf T}\nabla f(\bar{x})\\
&\qquad – 2\alpha (x_{k+1}-x_{k})^{\sf T}(\nabla f(x_k) – \nabla f(\bar{x}))\\
&\qquad – 2\alpha(x_{k}-\bar{x})^{\sf T}(\nabla f(x_k) – \nabla f(\bar{x}))
\end{aligned}
$$
The first term is non-positive by the optimality condition (\ref{Eq:Optimality-b}). The second term we upper bound using the Lipschitz constant of $f$, and the third term we upper bound using strong convexity of $f$. We quickly obtain
$$
2(x_{k+1}-x_{k})^{\sf T}(x_{k+1}-\bar{x}) \leq -2\alpha m\|x_{k}-\bar{x}\|_2^2 + 2\alpha L \|x_{k+1}-x_{k}\|_2\|x_{k}-\bar{x}\|_2
$$
Using (\ref{Eq:FNE1-c}) to replace the left-hand side, we have
$$
\begin{aligned}
\|x_{k+1}-x_{k}\|_2^2 &\leq \left(\|x_{k}-\bar{x}\|_2^2 – \|x_{k+1}-\bar{x}\|_2^2\right)\\
&\qquad -2\alpha m\|x_{k}-\bar{x}\|_2^2\\
&\qquad + 2\alpha L \|x_{k+1}-x_{k}\|_2\|x_{k}-\bar{x}\|_2
\end{aligned}
$$
Dropping the second term on the RHS and resolving the resulting quadratic,
we obtain the bound
$$
\|x_{k+1}-x_{k}\|_2 \leq \left(\alpha L + \sqrt{1-2\alpha m + \alpha^2L^2}\right)\|x_{k}-\bar{x}\|_2
$$
which has nice linear behaviour in $\|x_{k}-\bar{x}\|_2$ but does not go to zero as $\alpha \to 0$.

Attempt #3 — Using Averaged Operators The projection operator is firmly non-expansive, and is therefore $1/2$-averaged. If $\alpha \in (0,2/L)$, the forward step operator $F(x) = (I-\alpha \nabla f)(x)$ is averaged with parameter $\alpha L/2$, since it can be written as
$$
F(x) = (I-\tfrac{\alpha L}{2})x + \tfrac{\alpha L}{2}\left(x-\frac{2}{L}\nabla f(x)\right).
$$
It follows [Bauschke, Prop 4.32] that the forward-backward operator $T(x) = \mathrm{Proj}_{X}(x-\alpha\nabla f(x))$ is averaged with parameter
$$
\gamma = \frac{2}{1+\frac{1}{\max\{\tfrac{1}{2},\tfrac{\alpha L}{2}\}}} = \begin{cases}
2/3 &\text{if} \quad \alpha \in (0,1/L]\\
\frac{2\alpha L}{2+\alpha L} &\text{if} \quad \alpha \in (1/L,2/L)
\end{cases}
$$
Since $T$ is $\gamma$-averaged, it satisfies the inequality [Bauschke, Prop 4.25(iii)]
$$
\|T(x)-T(y)\|_2^2 \leq \|x-y\|_2^2 – \frac{1-\gamma}{\gamma}\|x – T(x) – y + T(y)\|_2^2.
$$
Set $x = x_{k}$, so that $T(x) = x_{k+1}$, and set $y = \bar{x}$, so that $T(y) = \bar{x}$. Then
$$
\|x_{k+1}-\bar{x}\|_2^2 \leq \|x_{k}-\bar{x}\|_2^2 – \epsilon\|x_{k+1}-x_{k}\|_2^2
$$
where
$$
\epsilon = \frac{1-\gamma}{\gamma} = \begin{cases}
1/2 &\text{if} \quad \alpha \in (0,1/L)\\
\frac{2-\alpha L}{\alpha L} &\text{if} \quad \alpha \in (1/L,2/L).
\end{cases}
$$
For $\alpha \in (0,1/L]$ we therefore obtain
$$
\begin{aligned}
\tfrac{1}{2}\|x_{k+1}-x_{k}\|_2^2 &\leq \|x_{k}-\bar{x}\|_2^2 – \|x_{k+1}-\bar{x}\|_2^2\\
\end{aligned}
$$
Bounding away the second term, we get
$$
\|x_{k+1}-x_{k}\|_2 \leq \sqrt{2}\|x_{k}-\bar{x}\|_2
$$
which slightly improves on Zim's bound, but does not have the right qualitative behaviour as $\alpha \to 0$. Intuitively, the quantity $\|x_{k}-\bar{x}\|_2^2 – \|x_{k+1}-\bar{x}\|_2^2$ should be proportional to $\alpha^2\|x_k-\bar{x}\|_2^2$, but convergence gives me a lower bound on this quantity, not an upper bound.

Simple Lower Bound Not directly what I'm looking for, but interesting and easy. Applying the reverse triangle inequality, we have
$$
\begin{aligned}
\|x_{k+1}-x_{k}\|_2 &= \|x_{k+1}-\bar{x} – (x_{k}-\bar{x})\|_2\\
&\geq \|x_{k}-\bar{x}\|_2 – \|x_{k+1}-\bar{x}\|_2\\
&\geq (1-\sqrt{1-\alpha m})\|x_{k}-\bar{x}\|_2
\end{aligned}
$$
which for small $\alpha$ further implies that
$$
\|x_{k+1}-x_{k}\|_2 \geq \alpha\frac{m}{2}\|x_{k}-\bar{x}\|_2
$$

Best Answer

Unfortunately, it is not possible. What you are looking for is a function $g(\alpha,\mu, L)$ that is dependent only on the Lipschitz constant $L$, the strong convexity constant $\mu$, and the set size $\alpha$, such that $\lim_{\alpha \to 0}g(\alpha, \mu, L)=0$ and \begin{align} ||x_{k+1}-x_{k}|| \leq g(\alpha,\mu,L) ||x_{k}-x^{*}|| \end{align} for any convex set X.

The issue is that the function $g$ has no ''information'' about the set $X$ you are projecting onto, it is independent of the set $X$. Therefore, by choosing the set $X(\alpha)$ as a function of $\alpha$ we can force $g$ to not have the property $\lim_{\alpha \to 0}g(\alpha, \mu, L)=0$.

Consider the following simple example $f(x) = \frac{1}{2}x^{2}$, so $\nabla f(x) = x$. Then $x_{k+1} = P_{X(\alpha)}((1 - \alpha) x_{k})$ and let $P_{X(\alpha)} = \{x\ |\ x\geq 1-\alpha,\ x\in\mathbb{R}\}$. Notice, that the minimizer for the set $X(\alpha)$ is $x^{*} = (1-\alpha)$.

Now let us set $x_0 = 1$, then we see that the condition $||x_{k+1}-x_{k}|| \leq g(\alpha,\mu,L) ||x_{k}-x^{*}||$ at the first iteration is,

\begin{align} ||x_{1}-x_{0}|| &\leq g(\alpha,\mu,L) ||x_{0}-x^{*}|| \\ ||P_{X(a)}((1 - \alpha)1) - 1|| &\leq g(\alpha,\mu,L) ||1-(1-\alpha)|| \\ ||(1 - \alpha) - 1|| &\leq g(\alpha,\mu,L) ||1-(1-\alpha)|| \\ \alpha &\leq g(\alpha,\mu,L) \alpha \\ 1 &\leq g(\alpha,\mu,L) \end{align} This example shows that for any $\alpha \in (0,1)$ you can construct a set $X$ and an initial condition $x_0$ such that $g(\alpha,\mu,L) \geq 1$. Therefore, there is no function $g$ such that $\lim_{\alpha \to 0}g(\alpha, \mu, L)=0$ and \begin{align} ||x_{k+1}-x_{k}|| \leq g(\alpha,\mu,L) ||x_{k}-x^{*}|| \end{align} for any convex set X.

Best Answer

Related Solutions

Convergence of gradient descent without global Lipschitz gradient assumption

Optimization – Does Gradient Convergence to Zero Imply Optimal Value?

Related Question