How to compute the derivative of the cross-entropy loss $H(P,Q)$ with respect to the weights $W$

machine learningstatistics

I'm trying to understand the cross-entropy loss with iris dataset for binary classification where y=1 denotes the plant belongs to Setosa and y=0 denotes the example belongs to Non-Setosa.

Consider the features (part of the original attributes) of a given example $x=[1.9, 0.4]$ which belongs to Setosa class, so y=1.

Adapted from the Deep Learning textbook, I guess the cross-entropy loss could be defined as follows:

$${\displaystyle H(P,Q)=-\sum_{j=1}^2 P(x_j)\, \log Q(x_j)}$$

where, j=1 denotes the given plant belongs to Setosa class while j=2 denotes Non-Setosa. $P(x)$ denotes the corresponding ground truth label probabilities, which is 100% when j=1 and 0% when j=2 in this particular case.

The meaning stated above is illustrated by this table

\begin{array}{w{c}{8mm}|w{c}{8mm}}
\text{ meaning } & a_j = Q(x_j) & j & y_j=P(x_j) \\
\hline
\text{ Setosa } & 0.284958 & 1 & 100\% \\
\text{ Non-Setosa } & 0.715042 & 2 & 0\%
\end{array}

where the computation of $a_j$ is given below.

Consider the example [1.9, 0.4],

I'm trying to use Q(x) to denote the output of a softmax regression model, representing the probability that the model classifies a given input image to a particular class, what does p(x) represents?

$$
a_j = {\displaystyle \operatorname{softmax} {(z_j)}={\frac {e^{z_{j}}}{\sum _{i=1}^{2}e^{z_{j}}}}\ \ \ \ {\text{ for }}j=1,2}
$$

where

$$z_j=w_j^Tx+b \ \ {\text{ for }}j=1,2$$

enter image description here

Assume all $b$s are 0.0, and the weights $W$ is initialized as

$$w_{11}=0.1, w_{21}=0.1, w_{12}=0.5, w_{22}=0.5$$

As stated at the beginning, $x_1=1.9, x_2=0.4$, so, the confidence that the model predict the given plant as Setosa would be computed as follows:

$$z_1 = w_{11}x_1 + w_{21}x_2 + b_1 = 0.23$$

$$a_1 = \operatorname{softmax} {(z_1)} = 0.284958$$

$${\displaystyle
H(p,q)=-\sum_{j=1}^2 P(x)\, \log Q(x) = -1.0 \times \log{(0.284958)}
}$$

the confidence that the model predict the given feature vector [1.9, 0.4] as Non-Setosa would be computed as follows:

$$z_2 = w_{12}x_1 + w_{22}x_2 + b_2 = 1.15$$

$$a_2 = \operatorname{softmax} {(z_2)} = 0.715042$$

$${\displaystyle
H(p,q)=-\sum_{j=1}^2 P(x)\, \log Q(x) = -1.0 \times \log{(0.715042)}
}$$

\begin{array}{w{c}{8mm}|w{c}{8mm}}
\text{ meaning } & a = Q(x) & j & y=P(x) \\
\hline
\text{ Setosa } & 0.284958 & 1 & 100\% \\
\text{ Non-Setosa } & 0.715042 & 2 & 0\%
\end{array}

I've managed to calculate the derivative of $a_j$ with respect to $z_j$

$$\frac{da_j}{dz_j} = a_j(1-a_j) \ \ \text{for } j=1,2$$

How do I compute the derivative of the loss $H(P,Q)$ with respect to the weights $W$ so that I can use the gradient descent algorithm to update the weights?

consider $j=1$, the derivative of $H(P,Q)$ is

\begin{align} \frac{\partial H}{\partial W_{i,1}} &=-\sum_{j=1}^2 \frac{P_j(x)}{a_1} \cdot \frac{\partial a_1}{\partial z_1}\cdot x_i \\ &=-\frac{1}{a_1} \cdot \frac{\partial a_1}{\partial z_1}\cdot x_i (P_1(x)+ P_2(x)) \end{align}

Is my understanding correct?

Best Answer

$$H = -\sum_{j=1}^2 P_j(x) \log Q_j(x)=-\sum_{k=1}^2 P_k(x) \log Q_k(x)$$

The trick is to use the chain rule.

\begin{align} \frac{\partial H}{\partial W_{i,j}} &= -\frac{\partial}{\partial W_{i,j}} \left(\sum_{k=1}^2 P_k(x) \log Q_k(x) \right) \\ &= -\frac{\partial}{\partial W_{i,j}} \left(P_j(x) \log Q_j(x) \right) \\ &=-\frac{P_j(x)}{Q_j(x)} \cdot \frac{\partial Q_j(x)}{\partial W_{i,j}} \\ &= - \frac{P_j(x)}{a_j(x)} \cdot \frac{\partial a_j(x)}{\partial z_j(x)} \cdot \frac{\partial z_j(x)}{\partial W_{i,j}}\\ &=-P_j(x) (1-a_j(x)) x_i \end{align}

Now we have the gradient, you can perform gradient descent.