Solved – How to find derivative of softmax function for the purpose of gradient descent

backpropagationderivativeneural networkssoftmax

I'm trying to understand back propagation algorithm for multiclass classification using gradient descent. I'm using https://www.cs.toronto.edu/~graves/phd.pdf . The output layer is a softmax layer, in which each unit in that layer has activation function:

enter image description here

Here, ak is the sum of inputs to unit 'k'.

Differentiating the above equation, the author has achieved this result.
enter image description here

I'm confused by the delta kk' and i have never seen anything like it.

Another question is do we consider the summation while taking the derivative, why or why not?

https://math.stackexchange.com/questions/945871/derivative-of-softmax-loss-function is a bit relevant, but the result of differentiation is different.

Best Answer

As whuber points out, $\delta_{ij}$ is the Kronecker delta https://en.wikipedia.org/wiki/Kronecker_delta :

$$ \begin{align} \delta_{ij} = \begin{cases} 0\: \text{when } i \ne j \\ 1 \: \text{when } i = j \end{cases} \end{align} $$

... and remember that a softmax has multiple inputs, a vector of inputs; and also gives a vector output, where the length of the input and output vectors are identical.

Each of the values in the output vector will change if any of the input vector values change. So the output vector values are each a function of all the input vector value:

$$ y_{k'} = f_{k'}(a_1, a_2, a_3,\dots, a_K) $$

where $k'$ is the index into the output vector, the vectors are of length $K$, and $f_{k'}$ is some function. So, the input vector is length $K$ and the output vector is length $K$, and both $k$ and $k'$ take values $\in \{1,2,3,...,K\}$.

When we differentiate $y_{k'}$, we differentiate partially with respect to each of the input vector values. So we will have:

  • $\frac{\partial y_{k'}}{\partial a_1}$
  • $\frac{\partial y_{k'}}{\partial a_2}$
  • etc ...

Rather than calculating individually for each $a_1$, $a_2$ etc, we'll just use $k$ to represent the 1,2,3, etc, ie we will calculate:

$$ \frac{\partial y_{k'}}{\partial a_k} $$

...where:

  • $k \in \{1,2,3,\dots,K\}$ and
  • $k' \in \{1,2,3\dots K\}$

When we do this differentiation, eg see https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/ , the derivative will be:

$$ \frac{\partial y_{k'}}{\partial a_k} = \begin{cases} y_k(1 - y_{k'}) &\text{when }k = k'\\ - y_k y_{k'} &\text{when }k \ne k' \end{cases} $$

We can then write this using the Kronecker delta, which is simply for notational convenience, to avoid having to write out the 'cases' statement each time.