Finding cross entropy gradient in matrix form

differentialmachine learningmatrix-calculus

I know how to get the cross entropy gradient of a single weight, but that's boring. I want to find the gradient w.r.t the whole weight matrix, but I'm not very confident in my results. I would really appreciate if anyone could check my work here and point me in the right direction. To start define some variables

  • $X \in \mathbb{R}^{n \times m}$, an input matrix
  • $\beta \in \mathbb{R}^{m \times k}$, a weight matrix
  • $Y \in \mathbb{R}^{n \times k}$, a one-hot target matrix
  • $J \in \mathbb{R}^{k \times n}$, a matrix of ones

And then some functions

  • $P:\mathbb{R}^{n \times m} \rightarrow \mathbb{R}^{n \times k}$, where $P(X) = X\beta$
  • $S: \mathbb{R}^{n \times k} \rightarrow \mathbb{R}^{n \times k}$, the softmax function applied row-wise
  • $L: \mathbb{R}^{n \times k} \rightarrow \mathbb{R}^{n \times k}$, take the logarithm of each element, $L(A)_{i, j} = \log(A_{i,j})$
  • $Z: \mathbb{R}^{n \times k} \rightarrow \mathbb{R}^{n \times k}$, where $Z(A) = Y \odot A$
  • $E: \mathbb{R}^{n \times k} \rightarrow \mathbb{R}$, defined by $E(A) = -tr(JA)$

Then cross entropy loss can be written as

$$\text{cross entropy} = E\Bigg[Z\bigg[L\bigg(S\big(P(X)\big) \bigg) \bigg] \Bigg] $$

Now I'm going to abuse notation and start using $P$ to represent $P(X) = X\beta$, $S$ to represent $S(P(X))$..etc. The chain rule tells us that if the function $F$ is differentiable at point $C$ and the function $G$ is differentiable at point $B = F(C)$, then the composite

$$H(X) = G(F(X)) $$

Is differentiable at $C$, and

$$ DH(C) = \big(DG(B)\big)\big(DF(C) \big)$$

Where $DF(C)$ denotes the Jacobian of $F$ at point $C$. Using this I should be able to write the Jacobian of cross entropy as

$$D\text{cross} = \big(DE(Z)\big)\big(DZ(L) \big)\big(DL(S)\big)\big(DS(P)\big)\big(DP(\beta)\big) $$

Now I just need to take differentials and find the jacobians.

First,
$$d \ vec \ E = -d \ tr(JZ)$$
$$ = -tr(JdZ)$$
$$= -\big(vec J \big)^T d \ vecZ $$
$$\Rightarrow DE(Z) = -\big(vec J \big)^T $$
Second,
$$d vec Z = vec \ d\big(Y \odot L \big) $$
$$= vec \big(Y \odot dL\big) $$
$$ = \text{Diag}\big[vec \ Y \big]d \ vec L $$
$$\Rightarrow DZ(L) = \text{Diag}\big[vec \ Y \big]$$

Third

$$d \ vec P = vec \ d(X\beta) $$
$$ = vec \big(Xd\beta\big)$$
$$= (I_K \otimes X)d \ vec(\beta)$$
$$\Rightarrow DP(\beta) = (I_K \otimes X)$$
Now lets assume I have calculated both of $DL(S), DS(P)$. Both of these matrices should be $nk \times nk$. If I combine all of this, I get

$$D\text{cross} = -\big(vec J \big)^T\text{Diag}\big[vec \ Y \big]\big(DL(S)\big)\big(DS(P)\big)(I_K \otimes X) $$

and I tentatively conclude that

$$ \frac{\partial \text{cross}}{\partial \beta}= -\big(vec J \big)^T\text{Diag}\big[vec \ Y \big]\big(DL(S)\big)\big(DS(P)\big)(I_K \otimes X) $$

This has dimension $1\times km$ which is almost what I want….Is this correct? If I use pytorch to calculate $DL(S), DS(P)$, is this derivation of gradient right?

Best Answer

$ \def\o{{\tt1}} \def\p{\partial} \def\l{\:{\sf cross}} \def\L{{\cal L}} \def\C{C^{-1}} \def\LR#1{\left(#1\right)} \def\op#1{\operatorname{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\Diag#1{\op{Diag}\LR{#1}} \def\qiq{\quad\implies\quad} \def\qif{\quad\iff\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\g#1{\color{blue}{#1}} \def\CLR#1{\c{\LR{#1}}} \def\GLR#1{\g{\LR{#1}}} \def\gred#1#2{\frac{\p #1}{\c{\p #2}}} $The Frobenius product (denoted by a colon) is extremely useful in Matrix Calculus $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \|A\|^2_{\c F} \qquad \big({\rm\c{Frobenius}\;norm}\big) \\ }$$ The properties of the underlying trace function allow the terms in a Frobenius product to be rearranged in many different ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ C:\LR{AB} &= \LR{CB^T}:A &= \LR{A^TC}:B \\ }$$ Furthermore, it commutes with the Hadamard product, i.e. $$\eqalign{ A:\LR{B\odot C} = \LR{A\odot B}:C \;=\; \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij}C_{ij} }$$ For typing convenience, I'll replace $\beta$ with $B$, and since your $E$ function can be replaced by the Frobenius product, let's redefine it to be the element-wise exponential $$\eqalign{ E &= \exp(P) \qiq \c{dE = E\odot dP} \\ }$$ which can be used along with Hadamard division $(\oslash)$ and the all-ones matrix $J$ to construct the row-wise SoftMax function $$\eqalign{ S &= E\oslash EJ \quad\qiq S\o = \o \\ \\ dS &= \c{dE}\oslash EJ \;-\; E\odot\c{d\LR{EJ}}\oslash EJ\oslash EJ \\ &= \c{E\odot dP}\oslash EJ \;-\; E\odot\CLR{dE\:J}\oslash EJ\oslash EJ \\ &= S\odot dP \;-\; S\odot\LR{dE\:J}\oslash EJ \\ }$$ The element-wise logarithm and cross-entropy functions can be differentiated $$\eqalign{ L &= \log(S) \\ dL &= dS\oslash S \\ &= dP - \LR{dE\:J}\oslash EJ \\ \\ \l &= -\trace{J(Y\odot L)} \\ &= -Y:L \\ d\l &= -Y:dL \\ &= Y:\LR{(dE\:J)\oslash EJ} \;-\; Y:dP \\ &= \LR{Y\oslash EJ}:\LR{dE\:J} \;-\; Y:dP \\ &= \GLR{\LR{Y\oslash EJ}J}:{dE} \;-\; Y:dP \\ &= \g{Q}:\c{dE} \;-\; Y:dP \\ &= Q:\CLR{E\odot dP} \;-\; Y:dP \\ &= \LR{E\odot Q}:dP \;-\; Y:dP \\ &= \LR{E\odot Q-Y}:dP \\ }$$ Finally, substituting $P=XB\:$ yields the desired gradient $$\eqalign{ d\l &= \LR{E\odot Q-Y}:(X\:dB) \\ &= X^T\LR{E\odot Q-Y}:{dB} \\ \grad{\l}{B} &= X^T\LR{E\odot Q-Y} \\ &= X^T\LR{E\odot\LR{\LR{Y\oslash EJ}J}\;-\;Y} \\ }$$ Notice that this gradient has the same dimensions as the matrix $B,\,$ which are the same as the dimensions of the product $X^TY$

If the one-hot matrix is defined such that $Y\o=\o,\:$ then the gradient expression simplifies to $$\eqalign{ \grad{\l}{B} &= X^T\LR{S-Y} \\ }$$

Related Question