[Math] Proof of Convergence for the Proximal Point Algorithm

convex optimizationconvex-analysisoptimizationproximal-operators

I'm trying to come up with a super simple proof of convergence on the proximal point algorithm, which uses the iteration scheme

$x^{i+1} = \mathbf{prox}_{tf}(x^i)$

where $f$ is a closed, convex (but not necessarily smooth) function and

$\mathbf{prox}_{tf}(z) = \arg\min_x f(x) + \frac{1}{2t}\|x-z\|_2^2$.

It's clear that if $x^i = x^{i+1}$ then $x$ must minimize $f$ (since the optimality condition $0\in \partial f(x)$ is satisfied; however, I have a small snafu when it comes to showing that a fixed point solution will be reached.

I think it's most elegant to use the fact that the proximal operator is firmly nonexpansive, that is, taking $u = \mathbf{prox}_{tf}(x)$ and $v = \mathbf{prox}_{tf}(y)$,

$(u-v)^T(x-y) \geq \|u-v\|_2^2$.

Via Cauchy-Schwarz, we have $\|x-y\|\geq \|u-v\|$. In other words

$\|x^i-x^{i-1}\|\geq \|x^{i+1}-x^i\|$

which is so close, but that $\leq$ is bothering me! I feel like there's some detail I'm missing, which is that if prox is firmly nonexpansive, I should be able to use some kind of strict inequality somewhere, to show that we can't ever have
$\|x^i-x^{i-1}\|= \|x^{i+1}-x^i\|$.

I'm rereading Eckstein and Bertsekas "On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators" but I'm not quite getting how they got around this problem. Anybody know what I'm missing?

Thank you!

Edit: A better way to frame the overall question is to show that, if $A:\mathbb R^n\to 2^{\mathbb R^n}$ is a firmly nonexpansive operator, ie satisfying

$(u-v)^T(x-y)\geq \|u-v\|_2^2$

for all $x$, $y$, $u\in A(x)$, $v\in A(y)$, then the iteration scheme

$x^{i+1} = A(x^i)$

converges to a fixed point solution.

That way, the solution extends beyond the proximal point method to its splitting variants as well (forward-backward, Douglas-Rachford, etc.)
(This is equivalent to Eckstein/Bertsekas' Theorem 3, but I was hoping for a simpler proof 🙂 )

Best Answer

I think you need to assume convexity of $f(x)$.

Given $x_i \in X$, define the iteration: $$ x_{i+1} = \arg\min_{x\in X}\left[f(x) + \frac{1}{2t}||x-x_i||^2\right] $$

Claim 1: Suppose $f(x)$ is a convex function over the convex set $X$ and has a global minimum $x^* \in X$. If $x_{i+1}=x_i$, then $x_{i+1}$ minimizes $f(x)$ over $X$.

Proof: Fix $\theta \in (0,1)$. Then $\theta x^* + (1-\theta) x_i \in X$. By definition of $x_{i+1}$ as minimizer of $f(x) + \frac{1}{2t}||x-x_i||^2$, it gets a value less than or equal to that of $(\theta x^* + (1-\theta) x_i)$, so: \begin{align} &f(x_{i+1}) + \frac{1}{2t}||x_{i+1}-x_i||^2 \\ &\leq f(\theta x^* + (1-\theta)x_i) + \frac{1}{2t}||(\theta x^* + (1-\theta)x_i) - x_i||^2 \\ &= f(\theta x^* + (1-\theta)x_i) + \frac{\theta^2}{2t}||x^*-x_i||^2 \\ &\leq \theta f(x^*) + (1-\theta)f(x_i) + \frac{\theta^2}{2t}||x^*-x_i||^2\\ &= \theta f(x^*) + (1-\theta)f(x_{i+1}) + \frac{\theta^2}{2t}||x^*-x_i||^2 \end{align} where the second-to-last inequality uses convexity of $f(x)$, and the last equality uses the fact that $x_i=x_{i+1}$. Note that $\theta>0$. Rearranging terms gives: $$ f(x_{i+1}) \leq f(x^*) + \frac{\theta}{2t}||x^*-x_i||^2 $$ This holds for all $\theta \in (0,1)$. Taking a limit as $\theta\rightarrow 0$ gives $f(x_{i+1}) \leq f(x^*)$. $\Box$


Claim 2: If $f(x_i)=f(x^*)$ for some iteration $i$, then $f(x_k)=f(x^*)$ for all iterations $k\geq i$.

Proof: Since $x_{i+1} \in X$ we have: $$ f(x^*) \leq f(x_{i+1}) \leq f(x_{i+1}) + \frac{1}{2t}||x_{i+1}-x_i||^2 \leq f(x_i) + \frac{1}{2t}||x_i-x_i||^2 = f(x_i) = f(x^*) $$ and so $f(x_{i+1})=f(x^*)$. The result follows by induction. $\Box$


Now define the error $\Delta_i = f(x_i) - f(x^*)$. Note that $\Delta_i\geq 0$ for all $i$. Assume the set $X$ is compact and $R = \sup_{x_1, x_2 \in X} ||x_1-x_2||$. To avoid trivialities, assume $R>0$.

Claim 3: Suppose $f(x)$ is a convex function over the convex set $X$ and has a global minimum $x^*\in X$. Then for all $\theta \in [0,1]$ we have: $$ \Delta_{i+1} \leq (1-\theta)\Delta_i + \frac{\theta^2R^2}{2t} $$

Proof: Fix $\theta \in [0,1]$. By a similar argument we have: \begin{align} &f(x_{i+1}) + \frac{1}{2t}||x_{i+1}-x_i||^2 \\ &\leq f(\theta x^* + (1-\theta)x_i) + \frac{1}{2t}||(\theta x^* + (1-\theta)x_i) - x_i||^2 \\ &= f(\theta x^* + (1-\theta)x_i) + \frac{\theta^2}{2t}||x^*-x_i||^2 \\ &\leq \theta f(x^*) + (1-\theta)f(x_i) + \frac{\theta^2}{2t}||x^*-x_i||^2\\ \end{align} Subtracting $f(x^*)$ from both sides gives: $$ \Delta_{i+1} \leq (1-\theta)\Delta_i + \frac{\theta^2R^2}{2t} $$ $\Box$

You can optimize the bound $(1-\theta)\Delta_i + \frac{\theta^2R^2}{2t}$ over $\theta \in [0,1]$ to get $\theta^* = \left[\frac{t\Delta_i}{R^2}\right]_0^1$. This gives two regimes:

i) Regime 1: Suppose $t\Delta_i/R^2 \geq 1$. Then $\theta^*=1$ and the bound from Claim 3 gives: $$ \Delta_{i+1} \leq \frac{R^2}{2t} < \frac{R^2}{t} $$ so we exit regime 1 after just 1 step.

ii) Regime 2: Suppose $0\leq t \Delta_i/R^2 < 1$. Then $\theta^*=t\Delta_i/R^2$ and the bound gives: $$ \Delta_{i+1} \leq \Delta_i - \frac{t\Delta_i^2}{2R^2} \implies \: \mbox{ if $\Delta_i>0$ then} \: \frac{\Delta_{i+1}}{\Delta_i} \leq 1 - \frac{t\Delta_i}{2R^2} < 1 $$ So the error cannot increase in this regime. Once we enter regime 2 we never leave. The error $\Delta_i$ also converges to 0. To see this, fix $\epsilon>0$. If the error ever gets less than $\epsilon$, then it stays less than $\epsilon$. If not, then $\Delta_i\geq \epsilon$ for all $i$ in regime 2, and so the regime 2 bound gives $\Delta_{i+1}/\Delta_i \leq 1 - \frac{t\epsilon}{2R^2}$ for all $i$, so we decay exponentially, which means we will eventually get smaller than $\epsilon$ (a contradiction). So we always eventually get smaller than $\epsilon$, this holds for all $\epsilon>0$.


Here is a counter-example to show that we may not converge to the global min, and we may not even move, when the function $f(x)$ is nonconvex. Define $f(x)$ as a function such that:

i) $f(x)$ has a unique minimum over the interval $(-\infty, 2]$ of $f(0)=0$.

ii) $f(x)$ has a minimum over the interval $(2, \infty)$ of $f(3)=-1$.

Suppose $t=1/2$ and $x_0=0$. We have $x_{i+1} = \min_{x\in\mathbb{R}} [f(x) + ||x-x_i||^2]$. Suppose $x_i=0$ (this is true for $x_0$). We show that $x_{i+1}=0$ also, so (by induction) we never move, and we never converge to the true global min $f(3)=-1$. Since $x_i=0$, we consider minimizing $f(x) + ||x-0||^2$:

-If $x \in (-\infty, 2]$ then $f(x) + ||x-0||^2 \geq f(0)+0 = 0$.

-If $x \in (2, \infty)$ then $f(x) + ||x-0||^2 \geq -1 + 2^2 = 3 > f(0)$.

It follows that $x_{i+1} = 0$. $\Box$

Another counter-example is if $f(x)=-(1/2)x^2$ and $t = 1/2$. Then if $x_0=0$ we get $x_1 = \arg\min_{x\in\mathbb{R}}[ f(x) + ||x-0||^2] = \arg\min_{x \in \mathbb{R}} \frac{x^2}{2} = 0$. So we stay at the global maximum of $f(x)$ (the infimum of $f(x)$ is $-\infty$).

Related Question