My textbook, Deep Learning by Goodfellow, Bengio, and Courville, says the following in a section on numerical computation:
Newton's method is based on using a second-order Taylor series expansion to approximate $f(\mathbf{x})$ near some point $\mathbf{x}^{(0)}$:
$$f(\mathbf{x}) \approx f(\mathbf{x}^{(0)}) + (\mathbf{x} – \mathbf{x}^{(0)})^T \nabla_{\mathbf{x}}f(\mathbf{x}^{(0)}) + \dfrac{1}{2}(\mathbf{x} – \mathbf{x}^{(0)})^T \mathbf{H}(f)(\mathbf{x}^{(0)})(\mathbf{x} – \mathbf{x}^{(0)})$$
If we then solve for the critical point of this function, we obtain
$$\mathbf{x}^* = \mathbf{x}^{(0)} – \mathbf{H}(f)(\mathbf{x}^{(0)})^{-1} \nabla_{\mathbf{x}} f(\mathbf{x}^{(0)})$$
$\mathbf{H}$ is the Hessian.
I was wondering if someone could please take the time to show how it is that we obtain
$$\mathbf{x}^* = \mathbf{x}^{(0)} – \mathbf{H}(f)(\mathbf{x}^{(0)})^{-1} \nabla_{\mathbf{x}} f(\mathbf{x}^{(0)})$$
if we solve for the critical point.
Best Answer
$\newcommand\bF{\mathbf{F}}\newcommand\bx{\mathbf{x}}\newcommand\bs{\mathbf{s}}\newcommand\bH{\mathbf{H}}$ For a stationary point you need to solve the system $$∇_\bx f(\bx^*)={\bf 0}.$$
Now $(∇_\bx f)'(\bx)=\bH_f(\bx)$ is the Hessean matrix of $f$ (also written as $∇_\bx^*∇_\bx f$ or simply $f''$) so that for the iteration towards the critical point the Newton step reads as $$ \bx_{new}=\bx-(∇_\bx f)'(\bx)^{-1}∇_\bx f(\bx)=\bx-\bH_f(\bx)^{-1}∇_\bx f(\bx) $$