Solved – Derivative of softmax and squared error

derivativegradient descentleast squaresneural networkssoftmax

I'm trying to understand the derivatives w.r.t. the softmax arguments when used in conjunction with a squared loss (for example as the last layer of a neural network).


I am using the following notation:

Softmax argument for class $i$:

$a_i=b_i + W_i \mathbf{h}$

Vector of softmax arguments:

$\mathbf{a} = (a_1,…,a_K)^T$

Softmax probability for class $i$:

$p_i(\mathbf{a}) = \frac{e^{a_i}}{\sum_k{e^{a_k}}}$

$\mathbf{p(a)} = (p_1(\mathbf{a}),…,p_K(\mathbf{a}))^T$

Vector with true labels:

$\mathbf{y} = (0, … 0,1, 0,…0)^T$

Squared loss:

$L(\mathbf{p(a)},\mathbf{y}) = \sum_i(p_i(\mathbf{a}) – y_i)^2$


Now, I want to get the derivative of the loss w.r.t. a particular softmax argument $a_j$.
This is what I arrived at:

$\frac{d}{da_j} L(\mathbf{p(a)},\mathbf{y})$

$= \sum_i \frac{d}{da_j} [p_i^2(\mathbf{a}) -2y_ip_i(\mathbf{a}) + y_i^2] $

$= \sum_i \frac{d}{dp_i} [p_i^2(\mathbf{a}) -2y_ip_i(\mathbf{a})] \frac{dp_i}{da_j} $

$=2 \sum_i (p_i(\mathbf{a}) – y_i) p_i(\mathbf{a}) (1_{i=j} – p_j(\mathbf{a})) $


Looking at the draft of the Bengio et al. Neural Networks book (Chapter 6, http://www.iro.umontreal.ca/~bengioy/DLbook/), they state this equation:

$\frac{d}{da_j} L(\mathbf{p(a)},\mathbf{y}) = 2 (\mathbf{p(a)} – \mathbf{y}) \odot \mathbf{p} \odot (1 – \mathbf{p}) $

Where $\odot$ is element-wise multiplication.

How should I interprete this?
Is there a mistake in my derivation?

Best Answer

Yes, your formula is correct. The formula in the draft chapter was for the sigmoid not for the softmax. We will fix it. Thanks for pointing it out.

-- Yoshua Bengio

Related Question