Derivative of Softmax loss function (with temperature T)

derivativesmachine learning

I am try to calculate the derivative of cross-entropy, when the softmax layer has the temperature T. That is:
\begin{equation}
p_j = \frac{e^{o_j/T}}{\sum_k e^{o_k/T}}
\end{equation}

This question here was answered at T=1: Derivative of Softmax loss function

Now what would be the final derivative in terms of $p_i$, $q_i$, and T? Please see the linked question for the notations.

Edit: Thanks to Alex for pointing out a typo

Best Answer

The cross-entropy loss for softmax outputs assumes that the set of target values are one-hot encoded rather than a fully defined probability distribution at $T=1$, which is why the usual derivation does not include the second $1/T$ term.

The following is from this elegantly written article:

\begin{split} \frac{\partial \xi}{\partial z_i} & = - \sum_{j=1}^C \frac{\partial t_j \log(y_j)}{\partial z_i}{} = - \sum_{j=1}^C t_j \frac{\partial \log(y_j)}{\partial z_i} = - \sum_{j=1}^C t_j \frac{1}{y_j} \frac{\partial y_j}{\partial z_i} \\ & = - \frac{t_i}{y_i} \frac{\partial y_i}{\partial z_i} - \sum_{j \neq i}^C \frac{t_j}{y_j} \frac{\partial y_j}{\partial z_i} = - \frac{t_i}{y_i} y_i (1-y_i) - \sum_{j \neq i}^C \frac{t_j}{y_j} (-y_j y_i) \\ & = - t_i + t_i y_i + \sum_{j \neq i}^C t_j y_i = - t_i + \sum_{j = 1}^C t_j y_i = -t_i + y_i \sum_{j = 1}^C t_j \\ & = y_i - t_i \end{split}

where $C$ is the number of output classes. The above derivation neither assumes the $T \ne 1$ condition nor that the target distribution is also a softmax output. So in order to find out what the gradient looks like when we plug in these two missing assumptions into the derivation, let's first see what we get when we plug in the $T \ne 1$ assumption:

\begin{split} \frac{\partial \xi}{\partial z_i} & = - \sum_{j=1}^C \frac{\partial t_j \log(y_j)}{\partial z_i}{} = - \sum_{j=1}^C t_j \frac{\partial \log(y_j)}{\partial z_i} = - \sum_{j=1}^C t_j \frac{1}{y_j} \frac{\partial y_j}{\partial z_i} \\ & = - \frac{t_i}{y_i} \frac{\partial y_i}{\partial z_i} - \sum_{j \neq i}^C \frac{t_j}{y_j} \frac{\partial y_j}{\partial z_i} = - \frac{t_i}{y_i} \frac{1}{T} y_i (1-y_i) - \sum_{j \neq i}^C \frac{t_j}{y_j} \frac{1}{T} (-y_j y_i) \\ & = -\frac{1}{T} t_i + \frac{1}{T} t_i y_i + \frac{1}{T} \sum_{j \neq i}^C t_j y_i = - \frac{1}{T} t_i + \frac{1}{T} \sum_{j = 1}^C t_j y_i = -\frac{1}{T} t_i + \frac{1}{T} y_i \sum_{j = 1}^C t_j \\ & = \frac{1}{T} (y_i - t_i) \end{split}

The last part, where the assumption that the targets are soft as well is also injected into the derivation, is beautifully summarized in section 2.1 of Prof. Hinton's 2015 paper titled 'Distilling the Knowledge in a Neural Network'. Rewriting the same in the context of the derivation given above, we get:

\begin{split} \frac{\partial \xi}{\partial z_i} & = \frac{1}{T} (y_i - t_i) = \frac{1}{T} (\frac{e^{z_i/T}}{\sum_{d=1}^C e^{z_d/T}} - \frac{e^{v_i/T}}{\sum_{d=1}^C e^{v_d/T}}) \end{split}

If the temperature is high compared with the magnitude of the logits, we can approximate: \begin{split} \frac{\partial \xi}{\partial z_i} & \approx \frac{1}{T} (\frac{1 + z_i/T}{C + \sum_{d=1}^C z_d/T} - \frac{1 + v_i/T}{C + \sum_{d=1}^C v_d/T}) \end{split}

since, we can indeed approximate $e^{very small value}$ with $1 + {very small value}$ (The denominator terms are nothing but a straightforward generalization of these values when summed up). If we now assume that the logits have been zero-meaned separately for each transfer case so that $\sum_{d} z_d = \sum_{d} v_d = 0$, then the above equation simplifies to: \begin{split} \frac{\partial \xi}{\partial z_i} & \approx \frac{1}{CT^2} (z_i - v_i) \end{split}

This is when we arrive at the $1 / T^2$ term. Here 'transfer set' refers to the dataset that is used to train the to-be-distilled student model, labelled using soft targets produced via the softmax outputs of the cumbersome teacher model(s).