Newton’s Method in Deep Learning (Goodfellow et. al)

machine learningmultivariable-calculus

In Goodfellow et. al's book Deep Learning, they cover Newton's method.

Newton's method is an optimization scheme based on using a second-order Taylor series expansion to approximate $J(\theta)$ near some point $\theta_0$, ignoring derivatives of higher order: $$ J(\theta) \approx J(\theta_0) + (\theta – \theta_0)^{T} \nabla_{\theta}J(\theta_0) + \frac{1}{2}(\theta – \theta_0)^{T} H(\theta – \theta_0) $$
If we then solve for the critical point of this function, we obtain the Newton parameter update rule: $$\theta^* = \theta_0 – H^{-1}\nabla_{\theta}J(\theta_0)$$ Note that $H$ is the Hessian Matrix of $J$ with respect to $\theta$.

I have two questions,

  1. If applied iteratively would the update rule essentially be unchanged if modified to $$\theta_{k+1} = \theta_{k} – H^{-1}\nabla_{\theta}J(\theta_k)$$

  2. When going over the training algorithm associated with Newton's method I noticed that they seemed to ignore $\theta_{0}$ even though including it as a required parameter to the algorithm. Image of training algorithm associated with Netwon's method from Deep Learning I am wondering if this was intentional or accidental and if it was accidental at which part in the algorithm would that parameter be used?

Best Answer

  1. $\theta^* = \theta_0 - H^{-1}\nabla_{\theta}J(\theta_0)$ is the minimizer of the equation for $J(\theta)$ (take derivative w.r.t. theta and set equal to zero). When used as an algorithm, the update rule is indeed the iterative version you give.
  2. Yes they should say "initialize: $\theta \leftarrow \theta_0$" or something to that effect.