Solved – Matrix Representation of Softmax Derivatives in Backpropagation

derivativesoftmax

I have a simple multilayer fully connected neural network for classification. At the last layer I have used softmax activation function. So I have to propagate the error through the softmax layer. Suppose, I have 3 softmax units at the output layer. Input to these 3 logits can be described by the vector
$z =\begin{pmatrix}z1\\z2\\z3\end{pmatrix}$. Now let's say those 3 logits output $y = \begin{pmatrix}y1\\y2\\y3\end{pmatrix}$. Now I want to calculate $
\frac{\partial y}{\partial z}$. Which is simply: $ $ $$\begin{equation} \\
\frac{\partial }{\partial z} softmax(z)
\end{equation}
$$
I know the derivatives of the softmax function are really $y(\delta_{ij}-y)$. Here $\delta$ is Kronecker delta. I can actually break down this expression and write down into two matrices( maybe here I am going wrong):
$$\texttt{matrix_a} =\begin{bmatrix}y1(1-y) & 0 & 0 \\0 & y2(1-y2) & 0\\0 &0 & y3(1-y3)\end{bmatrix}$$ and
$$\texttt{matrix_b} =\begin{bmatrix}0 & -y1y2 & -y1y3 \\-y1y2 & 0 & -y2y3\\-y1y3 &-y2y3 & 0\end{bmatrix}$$.
So finally, then I add these matrices to get the following matrix:
$$\texttt{matrix_c} =\begin{bmatrix}y1(1-y1) & -y1y2 & -y1y3 \\-y1y2 & y2(1-y2) & -y2y3\\-y1y3 &-y2y3 & y3(1-y3)\end{bmatrix}$$
Now if I take the sum over the rows I should get the column matrix of $\frac{\partial y}{\partial z}$. So the final column matrix containing the derivatives for $z$ is:
$$\texttt{matrix} =\begin{pmatrix}y1(1-y1)-y1y2-y1y3 \\-y1y2+y2(1-y2)-y2y3\\-y1y3-y2y3+y3(1-y3)\end{pmatrix}$$
But this is definitely wrong as $y1+y2+y3 = 1.0$ so I get the derivative for each of the softmax unit 0. Can you please tell me where I am doing it wrong and how I can make it correct? Thanks for reading.

Best Answer

The final matrix is already a matrix of derivatives $\frac{\partial y}{\partial z}$. Every element $i, j$ of the matrix correspond to the single derivative of form $\frac{\partial y_i}{\partial z_j}$.

One usually expects to compute gradients for the backpropagation algorithm but those can be computed only for scalars. In this case the $y$ is a vector hence we stack the gradients of it's components (a single component of $y$ vector is a scalar) forming a Jacobian matrix:

$$ \frac{\partial y}{\partial z} = \begin{bmatrix} \frac{\partial y_1}{\partial z} \\ \frac{\partial y_2}{\partial z} \\ \frac{\partial y_3}{\partial z} \end{bmatrix} = \begin{bmatrix} \frac{\partial y_1}{\partial z_1} & \frac{\partial y_1}{\partial z_2} & \frac{\partial y_1}{\partial z_3} \\ \frac{\partial y_2}{\partial z_1} & \frac{\partial y_2}{\partial z_2} & \frac{\partial y_2}{\partial z_3} \\ \frac{\partial y_3}{\partial z_1} & \frac{\partial y_3}{\partial z_2} & \frac{\partial y_3}{\partial z_3} \end{bmatrix} = \begin{bmatrix}y_1(1-y_1) & -y_1y_2 & -y_1y_3 \\-y_1y_2 & y_2(1-y_2) & -y_2y_3\\-y_1y_3 &-y_2y_3 & y_3(1-y_3)\end{bmatrix} $$

Note however that in backpropagation we would usually compute the gradient of some loss function $L$. Since it is a scalar we can compute it's gradient wrt. $z$:

$$ \frac{\partial L}{\partial z} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial z} $$

The component $\frac{\partial L}{\partial y}$ is a gradient (i.e. vector) which should be computed in the previous step of the backpropagation and depends on the actual loss function form (e.g. cross-entropy or MSE). The second component is the matrix shown above. By multiplying the vector $\frac{\partial L}{\partial y}$ by the matrix $\frac{\partial y}{\partial x}$ we get another vector $\frac{\partial L}{\partial x}$ which is suitable for another backpropagation step.

Note that the formula for $\frac{\partial L}{\partial z}$ might be a little difficult to derive in the vectorized form as shown above. It should be easier to start by computing the derivatives for single elements and vectorizing the equation later:

$$ \frac{\partial L}{\partial z_j} = \sum_i \frac{\partial L}{\partial y_i} \frac{\partial y_i}{\partial z_j} $$

The above equation for a single element of $\frac{\partial L}{\partial z}$ is equivalent to the vectorized form above. One could convince himself/herself by comparing this equation to the way the vector-matrix multiplication works.