Solved – Derivative of Softmax with respect to weights

derivativemachine learningneural networksself-studysoftmax

I'm new to deep learning and am attempting to calculate the derivative of the following function with respect to the matrix $\mathbf w$:

$$p(a) = \frac{e^{w_a^\top x}}{\Sigma_{d} e^{w_d^\top x}}$$

Using quotient rule, I get:
$$\frac{\partial p(a)}{\partial w} = \frac{xe^{w_a^\top x}\Sigma_{d} e^{w_d^\top x} – e^{w_a^\top x}\Sigma_{d} xe^{w_d^\top x}}{[\Sigma_{d} e^{w_d^\top x}]^2} = 0$$

I believe I'm doing something wrong, since the softmax function is commonly used as an activation function in deep learning (and thus cannot always have a derivative of $0$). I've gone over similar questions, but they seem to gloss over this part of the calculation.

I'd appreciate any pointers towards the right direction.

Best Answer

The last hidden layer produces output values forming a vector $\vec x = \mathbf x$. The output neuronal layer is meant to classify among $K=1,\dots,k$ categories with a SoftMax activation function assigning conditional probabilities (given $\mathbf x$) to each one the $K$ categories. In each node in the final (or ouput) layer the pre-activated values (logit values) will consist of the scalar products $\mathbf{w}_j^\top\mathbf{x}$, where $\mathbf w_j\in\{\mathbf{w}_1, \mathbf{w}_2,\dots,\mathbf{w}_k\}$. In other words, each category, $k$ will have a different vector of weights pointing at it, determining the contribution of each element in the output of the previous layer (including a bias), encapsulated in $\mathbf x$. However, the activation of this final layer will not take place element-wise (as for example with a sigmoid function in each neuron), but rather through the application of a SoftMax function, which will map a vector in $\mathbb R^k$ to a vector of $K$ elements in [0,1]. Here is a made-up NN to classify colors:

enter image description here

Defining the softmax as

$$ \sigma(j)=\frac{\exp(\mathbf{w}_j^\top \mathbf x)}{\sum_{k=1}^K \exp(\mathbf{w}_k^\top\mathbf x)}=\frac{\exp(z_j)}{\sum_{k=1}^K \exp(z_k)}$$


We want to get the partial derivative with respect to a vector of weights $(\mathbf w_i)$, but we can first get the derivative of $\sigma(j)$ with respect to the logit, i.e. $z_i = \mathbf w_i^\top \cdot \mathbf x$:

$$\begin{align} \small{\frac{\partial}{\partial( \mathbf{w}_i^\top \mathbf x)}}\sigma(j) &= \small{\frac{\partial}{\partial \left(\mathbf{w}_i^\top \mathbf x\right)}}\;\frac{\exp(\mathbf{w}_j^\top \mathbf x)}{\sum_{k=1}^K \exp(\mathbf{w}_k^\top\mathbf x)} \\[2ex] &\underset{*}{=} \frac{\frac{\partial}{\partial (\mathbf{w_i\top \mathbf x)}}\,\exp(\mathbf{w}_j^\top \mathbf x)}{\sum_{k=1}^K \exp(\mathbf{w}_k^\top\mathbf x)}\,-\,\frac{\exp(\mathbf w_j^\top \mathbf x)}{\left(\sum_{k=1}^K \exp(\mathbf{w}_k^\top\mathbf x) \right)^2}\quad\small{{\frac{\partial}{\partial \left(\mathbf w_i^\top\mathbf x\right)}}}\,\sum_{k=1}^K \exp(\mathbf{w}_k^\top\mathbf x)\\[2ex] &= \frac{\delta_{ij}\exp(\mathbf{w}_j^\top \mathbf x)}{\sum_{k=1}^K \exp(\mathbf{w}_k^\top\mathbf x)}\,-\,\frac{\exp(\mathbf w_j^\top \mathbf x)}{ \sum_{k=1}^K \exp\left(\mathbf{w}_k^\top\mathbf x \right)} \frac{\exp(\mathbf{w}_i^\top\mathbf x)}{\sum_{k=1}^K \exp\left(\mathbf{w}_k^\top\mathbf x \right)} \\[3ex] &=\sigma(j)\left(\delta_{ij}-\sigma(i)\right) \end{align}$$

$* \text{- quotient rule}$


Thanks and (+1) to Yuntai Kyong for pointing out that there was a forgotten index in the prior version of the post, and the changes in the denominator of the softmax had been left out of the following chain rule...

By the chain rule,

$$\begin{align}\frac{\partial}{\partial \mathbf{w}_i}\sigma(j)&= \sum_{k = 1}^K \frac{\partial}{\partial (\mathbf{w}_k^\top \mathbf x)}\sigma(j)\quad \frac{\partial}{\partial\mathbf{w}_i}\mathbf{w}_k^\top \mathbf{x}\\[2ex] &=\sum_{k = 1}^K \frac{\partial}{\partial (\mathbf{w}_k^\top \mathbf x)}\;\sigma(j)\quad \delta_{ik} \mathbf{x}\\[2ex] &=\sum_{k = 1}^K\sigma(j)\left(\delta_{kj}-\sigma(k)\right)\quad \delta_{ik} \mathbf{x} \end{align}$$

Combining this result with the previous equation:

$$\bbox[8px, border: 2px solid lime]{\frac{\partial}{\partial \mathbf{w}_i}\sigma(j)=\sigma(j)\left(\delta_{ij}-\sigma(i)\right)\mathbf x}$$