Steepest-descent optimization procedure with step size given by harmonic sequence

convergence-divergenceconvex optimizationgradient descentnumerical optimizationoptimization

Here is a minimization procedure I've "dreamed up." I'm hoping to gain a better understanding of its mathematical properties and practical efficiency.

Given a (locally) convex function $f(x):{\mathbb{R}}^n \to \mathbb{R}$, initial $x_1$, initial step size $a_1$, and tolerance $\delta$:

If $\lVert\nabla f(x_k )\rVert<\delta$, return $x_k$; otherwise:
Pick step direction $d_k \equiv -\nabla f(x_k )/\lVert\nabla f(x_k )\rVert$.
Pick step size $a_k$.
Let $x_{k+1} \equiv x_k +a_k d_k$.
Let $a_{k+1} \equiv a_1 /k$.
Let $k\equiv k+1$ and return to step 1.

Most optimization procedures require you to do some kind of line search after picking the step direction, but this algorithm avoids that computation by simply choosing an arbitrary $a_1$ and letting it decrease as the function iterates. Since

$$a_k =\frac{1}{k}$$

the step size approaches $0$ in the limit $k\to \infty$ and the sequence of iterates $\left\{ x_k \right\}$ is convergent. On the other hand, since the sum

$$\sum_{k=1}^{\infty } a_k =a_1 \sum_{k=1}^{\infty } \frac{1}{k}$$

is divergent, the cumulative sum of the step sizes is infinite, so assuming convexity, we will never get "stuck" at an $x$ far from $x^*$. (I am unsure of how to prove this formally.)

The above properties also apply for a more general algorithm where, in step 5, we let $a_{k+1} \equiv a_1 /k^t$ with $t\in (0,1]$.

Is there a name for this optimization procedure? What are its convergence properties? How should one select the initial values $x_1$ and $a_1$ in the general case?

Here is a proof-of-concept implementation in Matlab. Since we have to compute the gradient numerically, I have it evaluate the gradient over a "neighborhood" of size nsize around $x_k$. nsize is initialized to 0.01 and decreases by a factor of $k$ at each iteration, which prevents cycling.

[x, y] = minimize2d(@obj, -1.34, 1.79, 1, 0.01, 10e-15);
x_star = x(end)
y_star = y(end)
f_star = obj(x_star, y_star)

[x_plot, y_plot] = meshgrid(linspace(-1.6, 0.3, 51),linspace(.9, 1.9, 51));
z_plot = obj(x_plot, y_plot);
contour(x_plot, y_plot, z_plot, 10)
hold on
plot(x, y, "-k")
scatter(x_star, y_star)
hold off

function f = obj(x, y)
    f = 4*x.^2 + exp(1.5*y) + exp(-y) - 10*y;
end

function [x, y] = minimize2d(fun, x0, y0, a0, Nsize, tol)
    x = x0; y = y0; a = a0;
    
    grad_magnitude = tol + 1;
    i = 1;
    
    while grad_magnitude > tol
        a = a0 / i;
        Nsize = Nsize / i;
        [xN, yN] = meshgrid(linspace(x(i)-Nsize, x(i)+Nsize, 3), ...
            linspace(y(i)-Nsize, y(i)+Nsize, 3));
        f = fun(xN, yN);
        [px, py] = gradient(f);
        grad_magnitude = norm([px(2) py(2)]);
        step = -a * [px(2), py(2)] / norm([px(2) py(2)]);
        x(i+1) = x(i) + step(1);
        y(i+1) = y(i) + step(2);
        i = i + 1;
    end
    nit = i
end

Output:

nit = 16
x_star = -7.5968e-06
y_star = 1.2651
f_star = -5.6986

Best Answer

Upon finishing writing my answer, I realized that I misread your "step 2." What I write below is for a version of the algorithm where $d_k = -\nabla f(x_k)$, so that the magnitude of the gradient affects the actual step. I will still refer to $a_k$ as the "step size." I understand this is a bit different that the algorithm you have written, but I hope the answer is still helpful anyway.

This is essentially gradient descent where you have chosen a specific sequence of step sizes. Your "step 1" is a stopping criterion in place of "stop when $\nabla f(x_k)= 0$" to account for numerical imprecision.

There are many resources discussing the properties of gradient descent; here is a course with notes and here is a text. There you can find convergence results that depend on your assumptions on $f$. In some cases, a constant step size can get you a $O(1/\sqrt{k})$ error rate, while in special circumstances, a decreasing step size can guarantee a faster $O(1/k)$ error rate. I am being purposefully vague here because you need to introduce various technical notions to state these results precisely.

Finally, your observation about your step sizes diverging is something that Robbins and Monro observed for stochastic methods. In that context, the intuition is that the divergence condition $\sum_k a_k = \infty$ ensures that you have enough "gas" to explore the space, while the convergence condition $\sum_k a_k^2 < \infty$ ensures that your steps are decreasing sufficiently fast enough so that you can hone in on the solution instead of jumping wildly all over the place. Again, this is in the context of stochastic methods; I am not sure this intuition holds for non-stochastic methods like gradient descent.

The problem: multiple local optima

The derivative of your function reveals two local minima on the [0,1] interval, which correspond to the two solutions you have found. Local optimization methods will always find either of the two, depending on your initial guess x0. These methods will not attempt to find the best solution among multiple local minima, as global minimization methods do. Unfortunately, scipy.optimize.minimize deals exclusively with local optimization and does not implement any global method.

Solution: global optimization via multi-starting local methods

One way to find the global minimum is multi-start local minimization methods. Namely, you choose multiple starting points randomly and return the smallest local minimum that you find.

def multistart_minimize(h, n_trials=100):
    """Multi-start local optimization of a scalar function h."""
    best = None
    for _ in range(n_trials):
        now = scipy.optimize.minimize(h, np.random.rand(), bounds=((0.0,1.0),))
        if now.success and (best is None or best.fun > now.fun):
            best = now
    return best

Other ways?

You might find multi-start local optimization unsatisfactory because the outcome is random; indeed, some executions might fail to find the global optimum. And increasing the number of trials to counteract this issue is somewhat brutal with respect to computation time.

Literature on global optimization knows a number of deterministic algorithms without the problem of random results. It also knows other randomized algorithms that might work better than multi-start for one use case or the other.

My personal take on optimization is: it might be worth the try to approximate the function that you actually want to minimize with a convex surrogate function that has a similar solution. Convex functions only have one global optimum, which local optimization methods can find very efficiently.

Global Newton Method Algorithm for Optimisation

First there's the concept of taking a quadratic approximation of the objective function. This looks like

$$f(x) \approx f(x_0) + \nabla f(x_0) \cdot (x-x_0) + \frac{1}{2} (x-x_0)^T \nabla^2 f(x_0) (x-x_0).$$

If $\nabla^2 f(x_0)$ is positive definite then the minimum of this function is located at $x=x_0-(\nabla^2 f(x_0))^{-1} \nabla f(x_0)$. (This is basically the multivariate version of the $x=-b/(2a)$ formula for the vertex of a parabola.) One can in principle hope that this is somewhere at least closer to the minimum of the original $f$ than $x_0$ was.

This is just the Newton method for optimization. In the end it is the Newton method for finding zeros just applied to the function $\nabla f$.

This method is doing some other stuff too. First off in step 3 it decides whether to use the Newton search direction or the gradient descent search direction, based on which one looks like it would be better. The parameters $p$ and $q$ determine the threshold for doing this. I don't have much intuition for why that particular scheme with $p$ and $q$ makes sense.

Having picked a direction you line search along that direction. You want to go a long ways so that things have a good chance to improve but you don't want to go too far to the point that things aren't really improving at all. This particular method is asking that the actual decrease in $f$ be not too much worse than it should be based on the directional derivative, where "how much worse is acceptable" is controlled by the parameter $\delta$. Because they impose $\delta \in (0,1)$, this line search will always terminate (albeit perhaps for a problematically small $\alpha$) by the definition of the directional derivative. The original Newton method uses $\alpha=1$ which is often not reasonable if your current guess is not very good.

The last step should be self-explanatory.

Best Answer

Related Solutions

[Math] What’s a good way to provide an initial guess to “minimize $\sin(x)-e^{\cos(x^{2})} – x$ on $[0,1]$ with scipy.optimize.minimize”

The problem: multiple local optima

Solution: global optimization via multi-starting local methods

Other ways?

Global Newton Method Algorithm for Optimisation

Related Question