Steepest-descent optimization procedure with step size given by harmonic sequence

convergence-divergenceconvex optimizationgradient descentnumerical optimizationoptimization

Here is a minimization procedure I've "dreamed up." I'm hoping to gain a better understanding of its mathematical properties and practical efficiency.

Given a (locally) convex function $f(x):{\mathbb{R}}^n \to \mathbb{R}$, initial $x_1$, initial step size $a_1$, and tolerance $\delta$:

  1. If $\lVert\nabla f(x_k )\rVert<\delta$, return $x_k$; otherwise:
  2. Pick step direction $d_k \equiv -\nabla f(x_k )/\lVert\nabla f(x_k )\rVert$.
  3. Pick step size $a_k$.
  4. Let $x_{k+1} \equiv x_k +a_k d_k$.
  5. Let $a_{k+1} \equiv a_1 /k$.
  6. Let $k\equiv k+1$ and return to step 1.

Most optimization procedures require you to do some kind of line search after picking the step direction, but this algorithm avoids that computation by simply choosing an arbitrary $a_1$ and letting it decrease as the function iterates. Since

$$a_k =\frac{1}{k}$$

the step size approaches $0$ in the limit $k\to \infty$ and the sequence of iterates $\left\{ x_k \right\}$ is convergent. On the other hand, since the sum

$$\sum_{k=1}^{\infty } a_k =a_1 \sum_{k=1}^{\infty } \frac{1}{k}$$

is divergent, the cumulative sum of the step sizes is infinite, so assuming convexity, we will never get "stuck" at an $x$ far from $x^*$. (I am unsure of how to prove this formally.)

The above properties also apply for a more general algorithm where, in step 5, we let $a_{k+1} \equiv a_1 /k^t$ with $t\in (0,1]$.

Is there a name for this optimization procedure? What are its convergence properties? How should one select the initial values $x_1$ and $a_1$ in the general case?

Here is a proof-of-concept implementation in Matlab. Since we have to compute the gradient numerically, I have it evaluate the gradient over a "neighborhood" of size nsize around $x_k$. nsize is initialized to 0.01 and decreases by a factor of $k$ at each iteration, which prevents cycling.

[x, y] = minimize2d(@obj, -1.34, 1.79, 1, 0.01, 10e-15);
x_star = x(end)
y_star = y(end)
f_star = obj(x_star, y_star)

[x_plot, y_plot] = meshgrid(linspace(-1.6, 0.3, 51),linspace(.9, 1.9, 51));
z_plot = obj(x_plot, y_plot);
contour(x_plot, y_plot, z_plot, 10)
hold on
plot(x, y, "-k")
scatter(x_star, y_star)
hold off

function f = obj(x, y)
    f = 4*x.^2 + exp(1.5*y) + exp(-y) - 10*y;
end

function [x, y] = minimize2d(fun, x0, y0, a0, Nsize, tol)
    x = x0; y = y0; a = a0;
    
    grad_magnitude = tol + 1;
    i = 1;
    
    while grad_magnitude > tol
        a = a0 / i;
        Nsize = Nsize / i;
        [xN, yN] = meshgrid(linspace(x(i)-Nsize, x(i)+Nsize, 3), ...
            linspace(y(i)-Nsize, y(i)+Nsize, 3));
        f = fun(xN, yN);
        [px, py] = gradient(f);
        grad_magnitude = norm([px(2) py(2)]);
        step = -a * [px(2), py(2)] / norm([px(2) py(2)]);
        x(i+1) = x(i) + step(1);
        y(i+1) = y(i) + step(2);
        i = i + 1;
    end
    nit = i
end

Output:

nit = 16
x_star = -7.5968e-06
y_star = 1.2651
f_star = -5.6986

function and convergence

Best Answer

Upon finishing writing my answer, I realized that I misread your "step 2." What I write below is for a version of the algorithm where $d_k = -\nabla f(x_k)$, so that the magnitude of the gradient affects the actual step. I will still refer to $a_k$ as the "step size." I understand this is a bit different that the algorithm you have written, but I hope the answer is still helpful anyway.


This is essentially gradient descent where you have chosen a specific sequence of step sizes. Your "step 1" is a stopping criterion in place of "stop when $\nabla f(x_k)= 0$" to account for numerical imprecision.

There are many resources discussing the properties of gradient descent; here is a course with notes and here is a text. There you can find convergence results that depend on your assumptions on $f$. In some cases, a constant step size can get you a $O(1/\sqrt{k})$ error rate, while in special circumstances, a decreasing step size can guarantee a faster $O(1/k)$ error rate. I am being purposefully vague here because you need to introduce various technical notions to state these results precisely.

Finally, your observation about your step sizes diverging is something that Robbins and Monro observed for stochastic methods. In that context, the intuition is that the divergence condition $\sum_k a_k = \infty$ ensures that you have enough "gas" to explore the space, while the convergence condition $\sum_k a_k^2 < \infty$ ensures that your steps are decreasing sufficiently fast enough so that you can hone in on the solution instead of jumping wildly all over the place. Again, this is in the context of stochastic methods; I am not sure this intuition holds for non-stochastic methods like gradient descent.

Related Question