[Math] How to Derive Softmax Function

derivativeslinear algebramachine learning

Can someone explain step by step how to to find the derivative of this softmax loss function/equation.

\begin{equation}
L_i=-log(\frac{e^{f_{y_{i}}}}{\sum_j e^{f_j}}) = -f_{y_i} + log(\sum_j e^{f_j})
\end{equation}

where:
\begin{equation}
f = w_j*x_i
\end{equation}
let:

\begin{equation}
p = \frac{e^{f_{y_{i}}}}{\sum_j e^{f_j}}
\end{equation}

The code shows that the derivative of $L_i$ when $j = y_i$ is:

\begin{equation}
(p-1) * x_i
\end{equation}

and when $j \neq y_i$ the derivative is:

\begin{equation}
p * x_i
\end{equation}

It seems related to this this post, where the OP says the derivative of:

\begin{equation}
p_j = \frac{e^{o_j}}{\sum_k e^{o_k}}
\end{equation}

is:

\begin{equation}
\frac{\partial p_j}{\partial o_i} = p_i(1 – p_i),\quad i = j
\end{equation}

But I couldn't figure it out. I'm used to doing derivatives wrt to variables, but not familiar with doing derivatives wrt to indxes.

Best Answer

We have a softmax-based loss function component given by: $$L_i=-log\left(\frac{e^{f_{y_i}}}{\sum_{j=0}^ne^{f_j}}\right)$$

Where:

  1. Indexed exponent $f$ is a vector of scores obtained during classification
  2. Index $y_i$ is proper label's index where $y$ is column vector of all proper labels for training examples and $i$ is example's index

Objective is to find: $$\frac{\partial L_i}{\partial f_k}$$

Let's break down $L_i$ into 2 separate expressions of a loss function component: $$L_i=-log(p_{y_i})$$

And vector of normalized probabilities:

$$p_k=\frac{e^{f_{k}}}{\sum_{j=0}^ne^{f_j}}$$

Let's substitute sum:

$$\sigma=\sum_{j=0}^ne^{f_j}$$

For $k={y_i}$ using quotient rule:

$$\frac{\partial p_k}{\partial f_{y_i}} = \frac{e^{f_k}\sigma-e^{2f_k}}{\sigma^2}$$

For $k\neq{y_i}$ during derivation $e^{f_k}$ is treated as constant:

$$\frac{\partial p_k}{\partial f_{y_i}} = \frac{-e^{f_k}e^{f_{y_i}}}{\sigma^2}$$

Going further:

$$\frac{\partial L_i}{\partial p_k}=-\left(\frac {1}{p_{y_i}}\right)$$

Using chain rule for derivation:

$$\frac{\partial L_i}{\partial f_k}=-\left(\frac {1}{\frac{e^{f_{k}}}{\sigma}}\right)\frac{\partial p_k}{\partial f_{y_i}}=-\left(\frac {\sigma}{{e^{f_{k}}}}\right)\frac{\partial p_k}{\partial f_{y_i}}$$

Considering $k$ and $y_i$, for $k=y_j$ after simplifications:

$$\frac{\partial L_i}{\partial f_k}=\frac{e^{f_k}-\sigma}{\sigma}=\frac{e^{f_k}}{\sigma}-1=p_k-1$$

And for $k\neq y_j$:

$$\frac{\partial L_i}{\partial f_k}=\frac{e^{f_k}}{\sigma}=p_k$$

These two equations can be combined using Kronecker delta:

$$\frac{\partial L_i}{\partial f_k}=p_k-\delta_{ky_i}$$

Related Question