Can someone explain step by step how to to find the derivative of this softmax loss function/equation.
\begin{equation}
L_i=-log(\frac{e^{f_{y_{i}}}}{\sum_j e^{f_j}}) = -f_{y_i} + log(\sum_j e^{f_j})
\end{equation}
where:
\begin{equation}
f = w_j*x_i
\end{equation}
let:
\begin{equation}
p = \frac{e^{f_{y_{i}}}}{\sum_j e^{f_j}}
\end{equation}
The code shows that the derivative of $L_i$ when $j = y_i$ is:
\begin{equation}
(p-1) * x_i
\end{equation}
and when $j \neq y_i$ the derivative is:
\begin{equation}
p * x_i
\end{equation}
It seems related to this this post, where the OP says the derivative of:
\begin{equation}
p_j = \frac{e^{o_j}}{\sum_k e^{o_k}}
\end{equation}
is:
\begin{equation}
\frac{\partial p_j}{\partial o_i} = p_i(1 – p_i),\quad i = j
\end{equation}
But I couldn't figure it out. I'm used to doing derivatives wrt to variables, but not familiar with doing derivatives wrt to indxes.
Best Answer
We have a softmax-based loss function component given by: $$L_i=-log\left(\frac{e^{f_{y_i}}}{\sum_{j=0}^ne^{f_j}}\right)$$
Where:
Objective is to find: $$\frac{\partial L_i}{\partial f_k}$$
Let's break down $L_i$ into 2 separate expressions of a loss function component: $$L_i=-log(p_{y_i})$$
And vector of normalized probabilities:
$$p_k=\frac{e^{f_{k}}}{\sum_{j=0}^ne^{f_j}}$$
Let's substitute sum:
$$\sigma=\sum_{j=0}^ne^{f_j}$$
For $k={y_i}$ using quotient rule:
$$\frac{\partial p_k}{\partial f_{y_i}} = \frac{e^{f_k}\sigma-e^{2f_k}}{\sigma^2}$$
For $k\neq{y_i}$ during derivation $e^{f_k}$ is treated as constant:
$$\frac{\partial p_k}{\partial f_{y_i}} = \frac{-e^{f_k}e^{f_{y_i}}}{\sigma^2}$$
Going further:
$$\frac{\partial L_i}{\partial p_k}=-\left(\frac {1}{p_{y_i}}\right)$$
Using chain rule for derivation:
$$\frac{\partial L_i}{\partial f_k}=-\left(\frac {1}{\frac{e^{f_{k}}}{\sigma}}\right)\frac{\partial p_k}{\partial f_{y_i}}=-\left(\frac {\sigma}{{e^{f_{k}}}}\right)\frac{\partial p_k}{\partial f_{y_i}}$$
Considering $k$ and $y_i$, for $k=y_j$ after simplifications:
$$\frac{\partial L_i}{\partial f_k}=\frac{e^{f_k}-\sigma}{\sigma}=\frac{e^{f_k}}{\sigma}-1=p_k-1$$
And for $k\neq y_j$:
$$\frac{\partial L_i}{\partial f_k}=\frac{e^{f_k}}{\sigma}=p_k$$
These two equations can be combined using Kronecker delta:
$$\frac{\partial L_i}{\partial f_k}=p_k-\delta_{ky_i}$$