Derivative of row-wise softmax matrix w.r.t. matrix itself

derivativesmatrix-calculusscalar-fields

Define matrix $\mathbf{M}' \in \mathbb{R}^{n \times k}$ as the result of the row-wise softmax operation on matrix $\mathbf{M} \in \mathbb{R}^{n \times k}$. Hence,

$$\mathbf{M}'_{ij} = \frac{\exp{\mathbf{M}_{ij}}}{\sum\limits_{b=1}^k \exp{\mathbf{M}_{ib}}}.$$

Now, I look at the derivative of a scalar function, e.g., the Frobenius norm, with respect to $\textbf{M}$, namely

$$
\frac{\partial E}{\partial \textbf{M}} = \frac{\partial \left\Vert \textbf{X} – \textbf{M}'\textbf{H}\right\Vert_F}{\partial \textbf{M}}.
$$

I don't have any problem calulating the derivative of the above function w.r.t. $\textbf{M}'$. However, I am interested in finding the derivative w.r.t. $\textbf{M}$, which means that I somehow have to deal with the row-wise softmax operation. Since softmax is a vector function, but I am interested in finding the derivative w.r.t. the whole matrix $\textbf{M}$ at once, I don't know how to deal with it best. Do I need to calculate the derivative w.r.t. each vector $\textbf{M}_{i:}$ seperately? Also, the derivative of the softmax would yield a Jacobian matrix of dimensionality $k \times k$. Getting one Jacobian for each row vector $\textbf{M}_{i:}$ seems to mess up the dimensionality, assuming I would need to concatenate all those Jacobians… I am not sure where my mistake is. However, it feels like I am stuck.

It would be great if you could help me out 🙂

Thanks in advance and best regards.

Best Answer

Denote elementwise/Hadamard multiplication and division by the symbols $(\odot,\oslash)$ respectively.

Define an all-ones matrix $J\in{\mathbb R}^{k\times k},\,$ as well as the following matrices $$\eqalign{ P &= \exp(M) \quad&\implies dP &= P\odot dM \\ Q &= PJ &\implies dQ &= dP\,J = (P\odot dM)\,J \\ R &= P\oslash Q &\implies dR &= dP\oslash Q - P\odot dQ\oslash Q\oslash Q \\ &&&= R\odot dM - S\odot\Big((P\odot dM)\,J\Big) \\ S &= R\oslash Q \\ Y &= RH-X \\ }$$ where the differentials of all the new matrices have been expressed in terms of that of $M$.
Also note that in your naming convention $$\eqalign{ R &= M' \\ Q_{ij} &= \sum_{b=1}^k \Big(\exp M_{ib}\Big) J_{bj}\\ }$$ Define the scalar ${\cal E}$ function in terms of the new matrices and calculate its gradient. $$\eqalign{ {\cal E}^2 &= \|Y\|_F^2 \;=\; Y:Y \\ 2{\cal E}\,d{\cal E} &= 2Y:dY \\ &= 2(RH-X):dR\,H \\ &= 2(RH-X)H^T:dR \\ d{\cal E} &= A:dR \\ &= A:\bigg(R\odot dM - S\odot\Big((P\odot dM)\,J\Big)\bigg) \\ &= (R\odot A):dM - P\odot\Big((S\odot A)J\Big):dM \\ \frac{\partial{\cal E}}{\partial M} &= R\odot A - P\odot\Big((S\odot A)J\Big) \\ \\ }$$


In the above steps, the implicit definitions $$\eqalign{ A &= \left(\frac{RHH^T-XH^T}{\|X-RH\|_F}\right) \\ A:B &= {\rm Tr}(A^TB) \\ }$$ were utilized; the latter being the trace/Frobenius product.