[Math] Newton’s method in higher dimensions explained

calculuslinear algebramultivariable-calculusoptimizationpartial derivative

I'm studying about Newton's method and I get the single dimension case perfectly, but the multidimensional version makes me ask question…

In Wikipedia Newton's method in higher dimensions is defined as:

$$\textbf{x}_{n+1} = \textbf{x}_n – [Hf(\textbf{x}_n)]^{-1}\nabla f(\textbf{x}_n), \;\;\; n \geq 0.$$

Where $\textbf{x}_n$ is the $p$-dimensional vector at $n$th iteration, $[Hf(\textbf{x}_n)]^{-1}$ is the inverse of the Hessian matrix of the function $f(\textbf{x})$ at $\textbf{x}_n$ and $\nabla f(\textbf{x}_n)$ is the gradient of the function $f(\textbf{x})$ at $\textbf{x}_n$. That is:

$$\left( \begin{array}{c}
x_1^{(n+1)} \\
x_2^{(n+1)} \\
\vdots \\
x_p^{(n+1)} \end{array} \right) = \left( \begin{array}{c}
x_1^{(n)} \\
x_2^{(n)} \\
\vdots \\
x_p^{(n)} \end{array} \right) – \left( \begin{array}{cccc}
\frac{\partial^2f}{\partial x_1^2}(\textbf{x}_n) & \dots & \dots &\frac{\partial^2f}{\partial x_p\partial x_1}(\textbf{x}_n)\\
\frac{\partial^2f}{\partial x_1\partial x_2}(\textbf{x}_n) & \ddots & \vdots & \vdots\\
\vdots & \vdots & \vdots & \vdots\\
\frac{\partial^2f}{\partial x_1\partial x_p}(\textbf{x}_n) & \dots & \dots & \frac{\partial^2f}{\partial x_p^2}(\textbf{x}_n) \end{array} \right)^{-1}\left( \begin{array}{c}
\frac{\partial f}{\partial x_1}(\textbf{x}_n) \\
\frac{\partial f}{\partial x_2}(\textbf{x}_n) \\
\vdots \\
\frac{\partial f}{\partial x_p}(\textbf{x}_n) \end{array} \right)$$

Now my question is: "What is the intuition behind this formula?" This resembles somehow the gradient descent algorithm, but the inverse of the Hessian is like it came from the magician's hat :S Can somebody give me a similar kind of proof as is given here on the one-dimensional case:

Why does Newton's method work?

Why the Hessian? Why its inverse?! 🙂 Intuition of the formula?

Thank you for any help 🙂 P.S. I here is the page I got the formula above:

http://en.wikipedia.org/wiki/Newton%27s_method_in_optimization#Higher_dimensions

Note also that in my notation the topscript in the $x_i$s doesn't mean exponent, it's just an iteration label…

Best Answer

I'll assume we're trying to minimize a twice continuously differentiable function $f$ defined on $\mathbb R^p$.

We wish to find $x$ such that $\nabla f(x) = 0$.

Given $x_n$, we would ideally like to find $\Delta x$ such that $\nabla f(x_n + \Delta x) = 0$. Rather than satisfying this requirement exactly (which would probably be too difficult), we instead use the approximation \begin{equation*} \nabla f(x_n + \Delta x) \approx \nabla f(x_n) + Hf(x_n) \Delta x. \end{equation*} Setting the right hand side equal to $0$ gives us \begin{equation*} \Delta x = -Hf(x_n)^{-1} \nabla f(x_n). \end{equation*} We can hope that $x_{n+1} = x_n + \Delta x$ will be an improvement on $x_n$.