Newton’s method and gradient descent in deep learning

gradient descentmachine learningmultivariable-calculusnumerical methodsvector analysis

My textbook, Deep Learning by Goodfellow, Bengio, and Courville, says the following in a section on numerical computation:

Newton's method is based on using a second-order Taylor series expansion to approximate $f(\mathbf{x})$ near some point $\mathbf{x}^{(0)}$:

$$f(\mathbf{x}) \approx f(\mathbf{x}^{(0)}) + (\mathbf{x} – \mathbf{x}^{(0)})^T \nabla_{\mathbf{x}}f(\mathbf{x}^{(0)}) + \dfrac{1}{2}(\mathbf{x} – \mathbf{x}^{(0)})^T \mathbf{H}(f)(\mathbf{x}^{(0)})(\mathbf{x} – \mathbf{x}^{(0)})$$

If we then solve for the critical point of this function, we obtain

$$\mathbf{x}^* = \mathbf{x}^{(0)} – \mathbf{H}(f)(\mathbf{x}^{(0)})^{-1} \nabla_{\mathbf{x}} f(\mathbf{x}^{(0)}) \tag{4.12}$$

When $f$ is a positive definite quadratic function, Newton's method consists of applying equation 4.12 once to jump to the minimum of the function directly. When $f$ is not truly quadratic but can be locally approximated as a positive definite quadratic, Newton's method consists of applying equation 4.12 multiple times. Iteratively updating the approximation and jumping to the minimum of the approximation can reach the critical point much faster than gradient descent would. This is a useful property near a local minimum, but it can be a harmful property near a saddle point. As discussed in section 8.2.3, Newton's method is only appropriate when the nearby critical point is a minimum (all the eigenvalues of the Hessian are positive), whereas gradient descent is not attracted to saddle points unless the gradient points towards them.

After reading this, I have the following questions:

  1. How is it that, when $f$ is a positive definite quadratic function, Newton's method consists of applying equation 4.12 once to jump to the minimum of the function directly? What is the reasoning behind how this works?

  2. How is it that, when $f$ is not truly quadratic but can be locally approximated as a positive definite quadratic, Newton's method consists of applying equation 4.12 multiple times? What is the reasoning behind how this works (in the same vein as 1.)?

  3. How is it that iteratively updating the approximation and jumping to the minimum of the approximation can reach the critical point much faster than gradient descent would?

I would greatly appreciate it if people with deeper knowledge of machine learning could please take the time to clarify this.

Best Answer

  1. When $f$ is quadratic, the second order approximation (see the $f(\mathbf{x}) \approx \cdots$ approximation in your post) is actually an equality. The Newton update (4.12) is the exact minimizer of the function on the right-hand side (take the gradient of the right-hand side and set it to zero), so in this case it also minimizes the original function $f$ (the left-hand side) since the approximation is actually equality.

  2. The Newton algorithm is defined as performing (4.12) multiple times. There is no guarantee of convergence to a local minimum. But intuitively, if you are near a local minimum, the second-order approximation should resemble the actual function, and the minimum of the approximation should be close to the minimum of the actual function.

  3. This isn't a guarantee. But under certain conditions one can make rigorous statements about the rates of convergence of Newton's method and gradient descent. Intuitively, the Newton steps minimize a second-order approximation, which uses more information than gradient descent which uses first-order approximations, and thus can be much faster under certain conditions.