Newton’s Method in Deep Learning (Goodfellow et. al)

machine learningmultivariable-calculus

In Goodfellow et. al's book Deep Learning, they cover Newton's method.

Newton's method is an optimization scheme based on using a second-order Taylor series expansion to approximate $J(\theta)$ near some point $\theta_0$, ignoring derivatives of higher order: $$ J(\theta) \approx J(\theta_0) + (\theta – \theta_0)^{T} \nabla_{\theta}J(\theta_0) + \frac{1}{2}(\theta – \theta_0)^{T} H(\theta – \theta_0) $$
If we then solve for the critical point of this function, we obtain the Newton parameter update rule: $$\theta^* = \theta_0 – H^{-1}\nabla_{\theta}J(\theta_0)$$ Note that $H$ is the Hessian Matrix of $J$ with respect to $\theta$.

I have two questions,

If applied iteratively would the update rule essentially be unchanged if modified to $$\theta_{k+1} = \theta_{k} – H^{-1}\nabla_{\theta}J(\theta_k)$$
When going over the training algorithm associated with Newton's method I noticed that they seemed to ignore $\theta_{0}$ even though including it as a required parameter to the algorithm. I am wondering if this was intentional or accidental and if it was accidental at which part in the algorithm would that parameter be used?

Best Answer

$\theta^* = \theta_0 - H^{-1}\nabla_{\theta}J(\theta_0)$ is the minimizer of the equation for $J(\theta)$ (take derivative w.r.t. theta and set equal to zero). When used as an algorithm, the update rule is indeed the iterative version you give.
Yes they should say "initialize: $\theta \leftarrow \theta_0$" or something to that effect.

Best Answer

Related Solutions

[Math] Implementing gradient descent based on formula

Solving second-order Taylor series for critical point

Related Question