I'm currently reading the book Deep Learning (Goodfellow et al., 2015) and had a question regarding the calculation of a gradient when explaining backpropagation for a certain example. For anyone who's curious, this is from section 6.5.9: Differentiation outside the Deep Learning Community.
Suppose we have variables $p_1, p_2, … , p_n$ representing probabilities and variables $z_1, z_2, … , z_n$ representing unnormalized log probabilities. Suppose we define
$$q_i = \frac{e^{z_i}}{\sum_i e^{z_i}}$$
where we build the softmax function out of exponentiation, summation and division operations, and construct a cross-entropy loss $J = -\sum_i p_i \log{q_i}$. A human mathematician can observe that the derivateive of $J$ with respect to $z_i$ takes a very simple form: $q_i – p_i$.
I don't know how this result was derived, and was hoping that someone could give me some tips or advice. What I have so far is
$$\log{q_i} = \log{e^{z_i}} – \log({\sum_i e^{z_i}})$$
$$
\begin{align}
p_i\log{q_i} & = p_i \log{e^{z_i}} – p_i \log({\sum_i e^{z_i}}) \\
& = p_iz_i – p_i\log(\sum_i e^{z_i})
\end{align}$$
If we take the derivative of $J = p_i\log{q_i}$ then I can understand that $d/dz_i (p_i z_i) = p_i$, but how do we differentiate the second term that contains the logarithm of the summation?
Thank you.
Best Answer
Your derivation of $p_i\log q_i$ is fine. Based upon it we obtain for $J$:
\begin{align*} J&=-\sum_{j=1}^np_jz_j+\sum_{j=1}^np_j\log\left(\sum_{k=1}^ne^{z_k}\right)\\ &=-\sum_{j=1}^np_jz_j+\log\left(\sum_{k=1}^ne^{z_k}\right)\tag{1} \end{align*}
In the last line we use the sum of the probabilities $p_j,1\leq j\leq n$ is equal to $1$.