Your derivatives $\large \frac{\partial p_j}{\partial o_i}$ are indeed correct, however there is an error when you differentiate the loss function $L$ with respect to $o_i$.
We have the following (where I have highlighted in $\color{red}{red}$ where you have gone wrong)
$$\frac{\partial L}{\partial o_i}=-\sum_ky_k\frac{\partial \log p_k}{\partial o_i}=-\sum_ky_k\frac{1}{p_k}\frac{\partial p_k}{\partial o_i}\\=-y_i(1-p_i)-\sum_{k\neq i}y_k\frac{1}{p_k}({\color{red}{-p_kp_i}})\\=-y_i(1-p_i)+\sum_{k\neq i}y_k({\color{red}{p_i}})\\=-y_i+\color{blue}{y_ip_i+\sum_{k\neq i}y_k({p_i})}\\=\color{blue}{p_i\left(\sum_ky_k\right)}-y_i=p_i-y_i$$ given that $\sum_ky_k=1$ from the slides (as $y$ is a vector with only one non-zero element, which is $1$).
We have a softmax-based loss function component given by:
$$L_i=-log\left(\frac{e^{f_{y_i}}}{\sum_{j=0}^ne^{f_j}}\right)$$
Where:
- Indexed exponent $f$ is a vector of scores obtained during classification
- Index $y_i$ is proper label's index where $y$ is column vector of all proper labels for training examples and $i$ is example's index
Objective is to find: $$\frac{\partial L_i}{\partial f_k}$$
Let's break down $L_i$ into 2 separate expressions of a loss function component:
$$L_i=-log(p_{y_i})$$
And vector of normalized probabilities:
$$p_k=\frac{e^{f_{k}}}{\sum_{j=0}^ne^{f_j}}$$
Let's substitute sum:
$$\sigma=\sum_{j=0}^ne^{f_j}$$
For $k={y_i}$ using quotient rule:
$$\frac{\partial p_k}{\partial f_{y_i}} = \frac{e^{f_k}\sigma-e^{2f_k}}{\sigma^2}$$
For $k\neq{y_i}$ during derivation $e^{f_k}$ is treated as constant:
$$\frac{\partial p_k}{\partial f_{y_i}} = \frac{-e^{f_k}e^{f_{y_i}}}{\sigma^2}$$
Going further:
$$\frac{\partial L_i}{\partial p_k}=-\left(\frac {1}{p_{y_i}}\right)$$
Using chain rule for derivation:
$$\frac{\partial L_i}{\partial f_k}=-\left(\frac {1}{\frac{e^{f_{k}}}{\sigma}}\right)\frac{\partial p_k}{\partial f_{y_i}}=-\left(\frac {\sigma}{{e^{f_{k}}}}\right)\frac{\partial p_k}{\partial f_{y_i}}$$
Considering $k$ and $y_i$, for $k=y_j$ after simplifications:
$$\frac{\partial L_i}{\partial f_k}=\frac{e^{f_k}-\sigma}{\sigma}=\frac{e^{f_k}}{\sigma}-1=p_k-1$$
And for $k\neq y_j$:
$$\frac{\partial L_i}{\partial f_k}=\frac{e^{f_k}}{\sigma}=p_k$$
These two equations can be combined using Kronecker delta:
$$\frac{\partial L_i}{\partial f_k}=p_k-\delta_{ky_i}$$
Best Answer
The cross-entropy loss for softmax outputs assumes that the set of target values are one-hot encoded rather than a fully defined probability distribution at $T=1$, which is why the usual derivation does not include the second $1/T$ term.
The following is from this elegantly written article:
\begin{split} \frac{\partial \xi}{\partial z_i} & = - \sum_{j=1}^C \frac{\partial t_j \log(y_j)}{\partial z_i}{} = - \sum_{j=1}^C t_j \frac{\partial \log(y_j)}{\partial z_i} = - \sum_{j=1}^C t_j \frac{1}{y_j} \frac{\partial y_j}{\partial z_i} \\ & = - \frac{t_i}{y_i} \frac{\partial y_i}{\partial z_i} - \sum_{j \neq i}^C \frac{t_j}{y_j} \frac{\partial y_j}{\partial z_i} = - \frac{t_i}{y_i} y_i (1-y_i) - \sum_{j \neq i}^C \frac{t_j}{y_j} (-y_j y_i) \\ & = - t_i + t_i y_i + \sum_{j \neq i}^C t_j y_i = - t_i + \sum_{j = 1}^C t_j y_i = -t_i + y_i \sum_{j = 1}^C t_j \\ & = y_i - t_i \end{split}
where $C$ is the number of output classes. The above derivation neither assumes the $T \ne 1$ condition nor that the target distribution is also a softmax output. So in order to find out what the gradient looks like when we plug in these two missing assumptions into the derivation, let's first see what we get when we plug in the $T \ne 1$ assumption:
\begin{split} \frac{\partial \xi}{\partial z_i} & = - \sum_{j=1}^C \frac{\partial t_j \log(y_j)}{\partial z_i}{} = - \sum_{j=1}^C t_j \frac{\partial \log(y_j)}{\partial z_i} = - \sum_{j=1}^C t_j \frac{1}{y_j} \frac{\partial y_j}{\partial z_i} \\ & = - \frac{t_i}{y_i} \frac{\partial y_i}{\partial z_i} - \sum_{j \neq i}^C \frac{t_j}{y_j} \frac{\partial y_j}{\partial z_i} = - \frac{t_i}{y_i} \frac{1}{T} y_i (1-y_i) - \sum_{j \neq i}^C \frac{t_j}{y_j} \frac{1}{T} (-y_j y_i) \\ & = -\frac{1}{T} t_i + \frac{1}{T} t_i y_i + \frac{1}{T} \sum_{j \neq i}^C t_j y_i = - \frac{1}{T} t_i + \frac{1}{T} \sum_{j = 1}^C t_j y_i = -\frac{1}{T} t_i + \frac{1}{T} y_i \sum_{j = 1}^C t_j \\ & = \frac{1}{T} (y_i - t_i) \end{split}
The last part, where the assumption that the targets are soft as well is also injected into the derivation, is beautifully summarized in section 2.1 of Prof. Hinton's 2015 paper titled 'Distilling the Knowledge in a Neural Network'. Rewriting the same in the context of the derivation given above, we get:
\begin{split} \frac{\partial \xi}{\partial z_i} & = \frac{1}{T} (y_i - t_i) = \frac{1}{T} (\frac{e^{z_i/T}}{\sum_{d=1}^C e^{z_d/T}} - \frac{e^{v_i/T}}{\sum_{d=1}^C e^{v_d/T}}) \end{split}
If the temperature is high compared with the magnitude of the logits, we can approximate: \begin{split} \frac{\partial \xi}{\partial z_i} & \approx \frac{1}{T} (\frac{1 + z_i/T}{C + \sum_{d=1}^C z_d/T} - \frac{1 + v_i/T}{C + \sum_{d=1}^C v_d/T}) \end{split}
since, we can indeed approximate $e^{very small value}$ with $1 + {very small value}$ (The denominator terms are nothing but a straightforward generalization of these values when summed up). If we now assume that the logits have been zero-meaned separately for each transfer case so that $\sum_{d} z_d = \sum_{d} v_d = 0$, then the above equation simplifies to: \begin{split} \frac{\partial \xi}{\partial z_i} & \approx \frac{1}{CT^2} (z_i - v_i) \end{split}
This is when we arrive at the $1 / T^2$ term. Here 'transfer set' refers to the dataset that is used to train the to-be-distilled student model, labelled using soft targets produced via the softmax outputs of the cumbersome teacher model(s).