I'm studying about Newton's method and I get the single dimension case perfectly, but the multidimensional version makes me ask question…
In Wikipedia Newton's method in higher dimensions is defined as:
$$\textbf{x}_{n+1} = \textbf{x}_n – [Hf(\textbf{x}_n)]^{-1}\nabla f(\textbf{x}_n), \;\;\; n \geq 0.$$
Where $\textbf{x}_n$ is the $p$-dimensional vector at $n$th iteration, $[Hf(\textbf{x}_n)]^{-1}$ is the inverse of the Hessian matrix of the function $f(\textbf{x})$ at $\textbf{x}_n$ and $\nabla f(\textbf{x}_n)$ is the gradient of the function $f(\textbf{x})$ at $\textbf{x}_n$. That is:
$$\left( \begin{array}{c}
x_1^{(n+1)} \\
x_2^{(n+1)} \\
\vdots \\
x_p^{(n+1)} \end{array} \right) = \left( \begin{array}{c}
x_1^{(n)} \\
x_2^{(n)} \\
\vdots \\
x_p^{(n)} \end{array} \right) – \left( \begin{array}{cccc}
\frac{\partial^2f}{\partial x_1^2}(\textbf{x}_n) & \dots & \dots &\frac{\partial^2f}{\partial x_p\partial x_1}(\textbf{x}_n)\\
\frac{\partial^2f}{\partial x_1\partial x_2}(\textbf{x}_n) & \ddots & \vdots & \vdots\\
\vdots & \vdots & \vdots & \vdots\\
\frac{\partial^2f}{\partial x_1\partial x_p}(\textbf{x}_n) & \dots & \dots & \frac{\partial^2f}{\partial x_p^2}(\textbf{x}_n) \end{array} \right)^{-1}\left( \begin{array}{c}
\frac{\partial f}{\partial x_1}(\textbf{x}_n) \\
\frac{\partial f}{\partial x_2}(\textbf{x}_n) \\
\vdots \\
\frac{\partial f}{\partial x_p}(\textbf{x}_n) \end{array} \right)$$
Now my question is: "What is the intuition behind this formula?" This resembles somehow the gradient descent algorithm, but the inverse of the Hessian is like it came from the magician's hat :S Can somebody give me a similar kind of proof as is given here on the one-dimensional case:
Why does Newton's method work?
Why the Hessian? Why its inverse?! 🙂 Intuition of the formula?
Thank you for any help 🙂 P.S. I here is the page I got the formula above:
http://en.wikipedia.org/wiki/Newton%27s_method_in_optimization#Higher_dimensions
Note also that in my notation the topscript in the $x_i$s doesn't mean exponent, it's just an iteration label…
Best Answer
I'll assume we're trying to minimize a twice continuously differentiable function $f$ defined on $\mathbb R^p$.
We wish to find $x$ such that $\nabla f(x) = 0$.
Given $x_n$, we would ideally like to find $\Delta x$ such that $\nabla f(x_n + \Delta x) = 0$. Rather than satisfying this requirement exactly (which would probably be too difficult), we instead use the approximation \begin{equation*} \nabla f(x_n + \Delta x) \approx \nabla f(x_n) + Hf(x_n) \Delta x. \end{equation*} Setting the right hand side equal to $0$ gives us \begin{equation*} \Delta x = -Hf(x_n)^{-1} \nabla f(x_n). \end{equation*} We can hope that $x_{n+1} = x_n + \Delta x$ will be an improvement on $x_n$.