Knowledge Distillation math proof

machine learning

In the paper "Distilling the Knowledge in a Neural Network" by Hinton, soft target of student model is defined as cross-entropy $C$ between teacher model and student model.
Assume that $i$ is an integer, $i \in [1, N]$, where $N$ is number of class models are trained to classify.
In section 2.1, the paper is written as follow:

Each case in the transfer set contributes a cross-entropy gradient, $dC/dz_i$, with respect to each logit, $z_i$ of the distilled model.
If the cumbersome model has logits $v_i$ which produce soft target probabilities $p_i$ and the transfer training is done at a temperature of $T$, this gradient is given by:

$$
\frac{\partial C}{\partial z_i} = \frac{1}{T}(q_i – p_i) = \frac{1}{T}(\frac{e^{z_i/T}}{\sum_j e^{z_j/T}} – \frac{e^{v_i/T}}{\sum_j e^{v_j/T}}) \tag{2}
$$

If the (softmax) temperature is high compared with the magnitude of the logits, we can approximate:

$$
\frac{\partial C}{\partial z_i} \approx \frac{1}{T}\left(\frac{1 + z_i/T}{N + \sum_j z_j/T} – \frac{1 + v_i/T}{N + \sum_j v_j/T}\right) \tag{3}
$$

If we now assume that the logits have been zero-meaned separately for each transfer case so that
$\sum_j z_j = \sum_j v_j = 0$ Eq. 3 simplifies to:

$$
\frac{\partial C}{\partial z_i} \approx \frac{1}{NT^2} (z_i – v_i) \tag{4}
$$

So in the high temperature limit, distillation is equivalent to minimizing
$$
\frac{1}{2}(z_i − v_i)^2 \tag{5}
$$

, provided the logits are zero-meaned separately for each transfer case.

I believe this is a good paper, but it skipped so many steps that it is hard for a beginner like me to understand.

I already manage to get Eq. 2 by using cross entropy, and my problems are Eq. 3 and Eq. 5.
For Eq.3, I tried to use $\lim_{T\to\infty}e^{z_i/T} = \lim_{T\to\infty}1+z_i/T=1$, but I'm not sure I am correct or not.
For Eq.5, I just don't know how to get the equation.

Best Answer

For equation $(3)$, they use the approximation that

$$e^{x}\approx 1+x$$

when $x$ is small. Here $x$ are $\frac{z_i}{T}$ and also $\frac{v_i}{T}$.When $T$ is huge, this approximation is good.

For equation $(5)$, the quadratic equation attains the minimum when $z_i=v_i$ which is the same as setting equation $(4)$ to be $0$.

Related Question