Solved – Second order approximation of the loss function (Deep learning book, 7.33)

deep learningderivativeloss-functionsneural networks

In Goodfellow's (2016) book on deep learning, he talked about equivalence of early stopping to L2 regularisation (https://www.deeplearningbook.org/contents/regularization.html page 247).

Quadratic approximation of cost function $j$ is given by:

$$\hat{J}(\theta)=J(w^*)+\frac{1}{2}(w-w^*)^TH(w-w^*)$$

where $H$ is the Hessian matrix (Eq. 7.33). Is this missing the middle term? Taylor expansion should be:
$$f(w+\epsilon)=f(w)+f'(w)\cdot\epsilon+\frac{1}{2}f''(w)\cdot\epsilon^2$$

Best Answer

They talk about the weights at optimum:

We can model the cost function $J$ with a quadratic approximation in the neighborhood of the empirically optimal value of the weights $w^∗$

At that point, the first derivative is zero—the middle term is thus left out.