Solved – Why is the second derivative required for newton’s method for back-propagation

backpropagationgradient descentmachine learningoptimization

I am troubled with why isn't the Newton's method used for backpropagation, instead, or in addition to Gradient Descent more widely.

I have seen this same question, and the widely accepted answer claims

Newton's method, a root finding algorithm, maximizes a function using knowledge of its second derivative

I went and looked – According to Newton's Method from wikipedia,

Geometrically, (x1, 0) is the intersection of the x-axis and the tangent of the graph of f at (x0, f (x0)). The process is repeated until a sufficiently accurate value is reached

I really don't get where and why the second derivative should ever be calculated.

I also saw this similar question, and the accepted answer in short was:

the reason is that the cost functions mentioned might not have any zeroes at all, in which case Newton's method will fail to find the minima

This seems very similar to a similar problem of vanishing gradients in gradient descent, and probably would have about the same solutions, and still doesn't explain why the second derivative is required.

Please explain why is the calculation of the second derivative needed in order to calculate the Newton's method for back-propagation

Best Answer

My guess at your confusion:

Newton's method is often used to solve two different (but related) problems:

  1. Find $x$ such that $f(x) = 0$
  2. Find $x$ to minimize $g(x)$

A connection between the two problems if $f = g'$

If $g$ is continuous and differentiable, a necessary condition for an optimum to unconstrained minimization problem (2) is that the derivative $g'(x) = 0$. Let $f(x) = g'(x)$. If the first order condition $g'(x) = 0$ is also a sufficient condition for an optimum (eg. $g$ is convex), then (1) and (2) are the same problem.

Applying Newton's method, the update step for problem (1) is: $$ x_{t+1} = x_{t} - \frac{f(x_{t})}{f'(x_{t})}$$ The update step for problem (2) is: $$ x_{t+1} = x_{t} - \frac{g'(x_{t})}{g''(x_{t})}$$

If $f = g'$ then these are exactly the same thing.

In the optimization context, the Newton update step can be interpreted as creating a quadratic approximation of $g$ around point $x_t$. That is create the step $t$ quadratic approximation of $g$ as:

$$\hat{g}_t(x) = g(x_t) + g'(x_t)(x - x_t) + \frac{1}{2}g''(x_t)(x - x_t)^2$$ then choose $x_{t+1}$ to minimize $\hat{g}_t$.

This is the same update step as approximating $f = g'$ as a linear function $\hat{f}_t (x) = f(x_t) + f'(x_t)(x - x_t)$ and setting $x_{t+1}$ to be the value of $x$ where $\hat{f}_t$ crosses $0$.