Global Newton Method Algorithm for Optimisation

In our optimization script there is an algorithm we called Global Newton Method with Armijo-stepsize.
This is an algorithm to solve an unrestricted minimisation problem for an objective function $f: \mathbb{R}^n \to \mathbb{R}$.

While I have a decent understanding of how gradient descent algorithms generally work, I don't understand the descent step in this algorithm (I tried to unpack it for hours). And searching for Newton Method so far only references me to the algorithm used to find the $0$s of a function.

Let $S := \{2^{-i} \mid i \in \mathbb{N} \}$ then the algorithm is as follows

Algorithm:

(Initialisation): $\quad$ Choose $x_0 \in \mathbb{R}^n$, $\delta \in (0,1)$,
$p > 0$, $q > 2$, and $k:=0$.
(Test): $\quad$ If $\nabla f(x_k) = 0$, stop.
(Descent): $\quad$ Set $H_k := \nabla^2f(x_k)$ and calculate if possible $d_k$
as the solution of
$$H_k d = – \nabla f(x_k)$$
If this system either does not have a solution or the condition
${\nabla f(x_k)}^\top d_k \leq -p \Vert d_k \Vert^q$
is not fulfilled, then set $d_k := -\nabla f(x_k)$.
(Stepsize): $\quad$ $\alpha_k := \max \{\alpha \in S \mid f(x_k + \alpha d_k) \leq f(x_k) + \delta \alpha {\nabla f(x_k)}^\top d_k \} $
(Update): $\quad$ Set $x_{k+1} := x_k + \alpha_k d_k$ and $k := k+1$ and return to Step 2.

What does it mean intuitively to multiply the Hessian of a function with a vector?
Why the parameters $p$ and $q$?
Or generally: Is there a graphical interpretation of what we are doing?

I realise these are 'big' questions but any help would be appreciated.

Edit:
Thanks to Ian, I could make some sense of
$$H_k d = – \nabla f(x_k)$$

I'll expand for myself and the convenience of the reader:

If we approximate $f$ quadratically in $x_k$ like so:

$$f(x) \approx f(x_k) + \nabla f(x_k) (x – x_k) + \frac{1}{2} {(x – x_k)}^\top \nabla^2 f(x_k) (x – x_k)$$

Then if $\nabla^2 f(x_k)$ is positive definite, the minimum of $f$ is given by
$$
x = x_k – ({\nabla^2 f(x_k)}^{-1} \nabla f(x_k))
$$
Now multiply both sides by $\nabla^2 f(x_k)$ from the left
$$
\nabla^2 f(x_k) x = \nabla^2 f(x_k) x_k – \nabla f(x_k)
$$
By substracting $\nabla^2 f(x_k) x_k$ and applying the distributive law we get
$$
\nabla^2 f(x_k) (x – x_k) = – \nabla f(x_k).
$$
Replace $\nabla^2 f(x_k)$ with $H_k$ and $(x – x_k)$ with $d$ and we get $H_k d = -\nabla f(x_k)$. So $d_k$ is chosen such that it minimizes the quadratic approximation of $f$ in $x_k$.

Best Answer

First there's the concept of taking a quadratic approximation of the objective function. This looks like

$$f(x) \approx f(x_0) + \nabla f(x_0) \cdot (x-x_0) + \frac{1}{2} (x-x_0)^T \nabla^2 f(x_0) (x-x_0).$$

If $\nabla^2 f(x_0)$ is positive definite then the minimum of this function is located at $x=x_0-(\nabla^2 f(x_0))^{-1} \nabla f(x_0)$. (This is basically the multivariate version of the $x=-b/(2a)$ formula for the vertex of a parabola.) One can in principle hope that this is somewhere at least closer to the minimum of the original $f$ than $x_0$ was.

This is just the Newton method for optimization. In the end it is the Newton method for finding zeros just applied to the function $\nabla f$.

This method is doing some other stuff too. First off in step 3 it decides whether to use the Newton search direction or the gradient descent search direction, based on which one looks like it would be better. The parameters $p$ and $q$ determine the threshold for doing this. I don't have much intuition for why that particular scheme with $p$ and $q$ makes sense.

Having picked a direction you line search along that direction. You want to go a long ways so that things have a good chance to improve but you don't want to go too far to the point that things aren't really improving at all. This particular method is asking that the actual decrease in $f$ be not too much worse than it should be based on the directional derivative, where "how much worse is acceptable" is controlled by the parameter $\delta$. Because they impose $\delta \in (0,1)$, this line search will always terminate (albeit perhaps for a problematically small $\alpha$) by the definition of the directional derivative. The original Newton method uses $\alpha=1$ which is often not reasonable if your current guess is not very good.

The last step should be self-explanatory.

Best Answer

Related Solutions

[Math] Gradient descent orthogonal steps

Newton-Rhapson method on general strongly convex functions

Related Question