Solved – Vectorization of Cross Entropy Loss

machine learningneural networks

I am dealing with a problem related to finding the gradient of the Cross entropy loss function w.r.t. the parameter $\theta$ where:

$CE(\theta) = -\sum\nolimits_{i}{y_i*log({\hat{y}_{i}})}$

Where, $\hat{y}_{i} = softmax(\theta_i)$ and $\theta_i$ is a vector input.

Also, $y$ is a one hot vector of the correct class and $\hat{y}$ is the prediction for each class using softmax function.

Hence, for example lets have $y_i = \begin{pmatrix}0\\0\\0\\1\\0\end{pmatrix}$
and $\hat{y}_{i} = \begin{pmatrix}0.10\\0.20\\0.10\\0.40\\0.20\end{pmatrix}$

To find the partial derivative $\frac{\partial{CE(\theta)}}{\partial{\theta{ik}}} = -{y_{ik} – \hat{y}_{ik}}$

Taking from there for each $i$ the individual partial gradients will be
$\frac{\partial{CE(\theta)}}{\partial{\theta{i}}} = \begin{pmatrix}y_{i1} – \hat{y}_{i1}\\y_{i2} – \hat{y}_{i2}\\y_{i3} – \hat{y}_{i3}\\y_{i4} – \hat{y}_{i4}\\y_{i5} – \hat{y}_{i5}\end{pmatrix}$

But this is not true because the gradients should actually be 0 for all other rows except for the 4th row because we have used the property of the one hot vector. So actual gradient should be
$\frac{\partial{CE(\theta)}}{\partial{\theta{i}}} = \begin{pmatrix}0\\0\\0\\y_{i4} – \hat{y}_{i4}\\0\end{pmatrix}$

And hence the gradients for all $i$ should be
$\frac{\partial{CE(\theta)}}{\partial{\theta}} = \left( \begin{array}{ccc}
0 & 0 & 0 & y_{i4} – \hat{y}_{i4} & 0 \\
0 & 0 & y_{i3} – \hat{y}_{i3} & 0 & 0 \\
… \\
0 & y_{i2} – \hat{y}_{i2} & 0 & 0 & 0 \end{array} \right)$

But this is not equal to $\hat{y} – y$. So we should not call the gradient of the cross entropy function a vector difference between predicted and original.

Can some one clarify on this ?

UPDATE: Fixed my derivation

$\theta = \left( \begin{array}{c}
\theta_{1} \\
\theta_{2} \\
\theta_{3} \\
\theta_{4} \\
\theta_{5} \\
\end{array} \right)$

$CE(\theta) = -\sum\nolimits_{i}{y_i*log({\hat{y}_{i}})}$

Where, $\hat{y}_{i} = softmax(\theta_i)$ and $\theta_i$ is a vector input.

Also, $y$ is a one hot vector of the correct class and $\hat{y}$ is the prediction for each class using softmax function.

$\frac{\partial{CE(\theta)}}{\partial{\theta{i}}} = – (log(\hat{y}_{k}))$

UPDATE: Removed the index from $y$ and $\hat{y}$
Hence, for example lets have $y = \begin{pmatrix}0\\0\\0\\1\\0\end{pmatrix}$
and $\hat{y} = \begin{pmatrix}0.10\\0.20\\0.10\\0.40\\0.20\end{pmatrix}$

UPDATE: Fixed I was taking derivative w.r.t. $\theta_{ik}$ it should be only w.r.t. $\theta_{i}$.
To find the partial derivative $\frac{\partial{CE(\theta)}}{\partial{\theta{i}}} = -{y_{k} – \hat{y}_{k}}$

Taking from there for each $i$ the individual partial gradients will be
$\frac{\partial{CE(\theta)}}{\partial{\theta}} = \begin{pmatrix}y_{1} – \hat{y}_{1}\\y_{2} – \hat{y}_{2}\\y_{3} – \hat{y}_{3}\\y_{4} – \hat{y}_{4}\\y_{5} – \hat{y}_{5}\end{pmatrix}$

The above happens because $CE(\theta) = -(y_k*log({\hat{y}_{k}}))$
And, $\hat{y}_{k} = log(softmax(\theta_k)) = \theta_k – log(\sum\nolimits_{j}{exp(\theta_j)})$
Taking the partial derivative of $CE(\theta)$ w.r.t. $\theta_i$ we get:

$\frac{\partial{CE(\theta)}}{\partial{\theta{i}}} = – (\frac{\partial{\theta_k}}{\partial{\theta{i}}} – softmax(\theta_i))$

MAIN STEP:
The fact that $\frac{\partial{\theta_k}}{\partial{\theta{i}}} = 0, i \neq k$ and $\frac{\partial{\theta_k}}{\partial{\theta{i}}} = 1, i = k$ makes the vector $\frac{\partial{CE(\theta)}}{\partial{\theta}} = \hat{y} – y$ which completes the proof.

Best Answer

No, the gradients should not be zero for the other components. If your prediction is $\hat y_{ij}$ for some $i,j$ and your observation $y_{ij}=0$, then you predicted too much by $\hat y_{ij}$.

Related Question