Machine Learning – Derivative of Softmax Loss Function

derivativeslinear algebramachine learning

I am trying to wrap my head around back-propagation in a neural network with a Softmax classifier, which uses the Softmax function:

\begin{equation}
p_j = \frac{e^{o_j}}{\sum_k e^{o_k}}
\end{equation}

This is used in a loss function of the form

\begin{equation}L = -\sum_j y_j \log p_j,\end{equation}

where $o$ is a vector. I need the derivative of $L$ with respect to $o$. Now if my derivatives are right,

\begin{equation}
\frac{\partial p_j}{\partial o_i} = p_i(1 – p_i),\quad i = j
\end{equation}

and

\begin{equation}
\frac{\partial p_j}{\partial o_i} = -p_i p_j,\quad i \neq j.
\end{equation}

Using this result we obtain

\begin{eqnarray}
\frac{\partial L}{\partial o_i} &=& – \left (y_i (1 – p_i) + \sum_{k\neq i}-p_k y_k \right )\\
&=&p_i y_i – y_i + \sum_{k\neq i} p_k y_k\\
&=& \left (\sum_i p_i y_i \right ) – y_i
\end{eqnarray}

According to slides I'm using, however, the result should be

\begin{equation}
\frac{\partial L}{\partial o_i} = p_i – y_i.
\end{equation}

Can someone please tell me where I'm going wrong?

Best Answer

Your derivatives $\large \frac{\partial p_j}{\partial o_i}$ are indeed correct, however there is an error when you differentiate the loss function $L$ with respect to $o_i$.

We have the following (where I have highlighted in $\color{red}{red}$ where you have gone wrong) $$\frac{\partial L}{\partial o_i}=-\sum_ky_k\frac{\partial \log p_k}{\partial o_i}=-\sum_ky_k\frac{1}{p_k}\frac{\partial p_k}{\partial o_i}\\=-y_i(1-p_i)-\sum_{k\neq i}y_k\frac{1}{p_k}({\color{red}{-p_kp_i}})\\=-y_i(1-p_i)+\sum_{k\neq i}y_k({\color{red}{p_i}})\\=-y_i+\color{blue}{y_ip_i+\sum_{k\neq i}y_k({p_i})}\\=\color{blue}{p_i\left(\sum_ky_k\right)}-y_i=p_i-y_i$$ given that $\sum_ky_k=1$ from the slides (as $y$ is a vector with only one non-zero element, which is $1$).

Best Answer

Related Solutions

[Math] How to Derive Softmax Function

[Math] Gradient of a softmax applied on a linear function

Related Question