Solved – Second derivative of neural network cost function

derivativehessianneural networksregularization

This question is highly correlated with my previous one (I was asking about quadratic approximation of the cost function with Hessian matrix and didn't get any answer), but I think that I have the idea about the answer to it.

The problem I'm getting now is that we need to take the second derivatives of the cost function (backpropagation) with respect to $W^{(1)}$ and $W^{(2)}$ (the parameters of the second and first layer of our neural network). Out cost function is defined this way:

$$\frac{1}{2}\sum_{i=0}^n(y – \hat y)^2$$

When taking the first derivative with respect to $W^{(1)}$ and $W^{(2)}$, we're getting these formulas (I'm not using $\delta$ notations here):

$$\frac{dJ}{dW^{(2)}} = -(y-\hat y)f^{'}(z^{(3)})a^{(2)}$$
$$\frac{dJ}{dW^{(1)}} = -(y-\hat y)f^{'}(z^{(3)})W^{(2)}f^{'}(z^{(2)})x$$

As I understand, to have the Hessian matrix, we need to take the second derivatives $2c = \frac{d^{2}J}{{dW^{(2)}}^{2}}$, $2a = \frac{d^{2}J}{{dW^{(1)}}^{2}}$ and $b = \frac{d^{2}J}{dW^{(1)}dW^{(2)}}$.

These formulas should construct our Hessian matrix in the way:

$$H =
\begin{bmatrix}
a & b \\
b & c \\
\end{bmatrix}
$$

How do we do that? I mean, how do we take these second derivatives of the cost function with respect to 2 parameters we have?

Thank you in advance, will be very grateful if I get an answer to this question as long as I haven't found a lot of information about Hessian matrices and neural networks on the Internet and haven't found the explanation about Hessian matrices actual construction for cost function at all.

Best Answer

Second derivative is just a derivative over a derivative. So you differentiate with respect to one weight and for the function that you get after differentiation you need to differentiate again with respect to the same or some other weight.

$$\frac{d^2J}{dW^{(1)} dW^{(2)}} = \frac{d}{dW^{(1)}} \frac{dJ}{dW^{(2)}}= \frac{d}{dW^{(2)}} \frac{dJ}{dW^{(1)}}$$

Gradient is a function, for instance if

$$g^{(1)} = \frac{dJ}{dW^{(1)}}$$ $$g^{(2)} = \frac{dJ}{dW^{(2)}}$$

then

$$\frac{d^2J}{dW^{(1)} dW^{(2)}} = \frac{dg^{(2)}}{dW^{(1)}} = \frac{dg^{(1)}}{dW^{(2)}}$$

Updated:

It is difficult to find gradient and hessian for the complex networks. Typically, people use automatic differentiation. It uses chain rule to calculate gradients based on the defined computational graph, so you don't need to worry about the calculus.

Related Question