[Math] How to take partial derivative of a matrix with respect to another matrix

matricesmatrix-calculusmultivariable-calculus

If a matrix $Y$ of dimensions $(m,h)$ is defined as the product of matrices $X(m,x)$ and $W(x,h)$,$Y = X \cdot W$, how do I obtain the partial derivative of $Y$ with respect to $W$?
I need this derivative as part of the chain rule to eventually calculate the derivative of a scalar C with respect to $W$

$Y$=$X$*$W$

$output$=$row-$$wise$ $softmax$($Y$)

$CE$=-$label$ $\circ$ $log(output)$ ie element-wise multiplication

$C=$ $1/m\sum_{i,j} CE_{i,j}$ where m = number of rows in CE

Dimensions :

$X=$($m*x$)

$W=$($x*h$)

$Y,output,label=$($m*h$)

$C =scalar$

Best Answer

Since you evaluate the output function over rows, there's no need to deal with the matrix as a whole. You can write $Y=XW$ in row-wise form as $$y_k^T=x_k^TW \implies y_k=W^Tx_k$$ For now, let's drop the $k$ subscript and workout the gradient for a single row. Then at the end, we'll sum over all the rows.

Let's use (:) to represent the trace/Frobenius product, i.e. $A:B={\rm tr}(A^TB)$
and write down the variables and their differentials $$\eqalign{ y &= W^Tx &\implies dy = dW^Tx \cr s &= {\rm softmax}(y) \cr S &= {\rm Diag}(s) &\implies ds = (S-ss^T)dy \cr \phi &= -b:\log(s) \cr \beta &= s:b = s^Tb \cr }$$ Now find the differential and gradient of the $\phi$ function. $$\eqalign{ d\phi &= -b:S^{-1}ds \cr &= -b:(I-1s^T)dy \cr &= (\beta 1-b):dy \cr &= (\beta 1-b):dW^Tx \cr &= (\beta 1-b)x^T:dW^T \cr &= x(\beta 1-b)^T:dW \cr \frac{\partial\phi}{\partial W} &= x(\beta 1-b)^T \cr }$$ That's the gradient for a single row. To find the full gradient, sum over all the rows $$\eqalign{ \Phi &= \frac{1}{m}\sum_k \phi_k \cr \frac{\partial\Phi}{\partial W} &= \frac{1}{m}\sum_k \frac{\partial\phi_k}{\partial W} \cr &= \frac{1}{m}\sum_k x_k(\beta_k1-b_k)^T \cr }$$ When working through these expressions, it helps to know that the Frobenius product is commutative. $$\eqalign{ A:B &= B:A \cr }$$ It is also useful to know the rules for rearranging a Frobenius product, which follow from the cyclic properties of the trace. $$\eqalign{ A:BC &= B^TA:C = AC^T:B \cr\cr }$$ Here is a quick derivation of the gradient of the softmax $$\eqalign{ e &= \exp(x) &\implies de = e\circ dx \cr E &= {\rm Diag}(e) \cr \beta &= 1:e = 1^Te &\implies d\beta = 1:de\cr y &= {\rm softmax}(x)=\frac{e}{\beta} \cr Y &= {\rm Diag}(y) &\implies Y=\frac{E}{\beta} \cr \cr dy &= \frac{\beta de-e\,d\beta}{\beta^2} \cr &= \frac{\beta e\circ dx-e\,1:de}{\beta^2} \cr &= \frac{\beta e\circ dx-e\,(1:e\circ dx)}{\beta^2} \cr &= \frac{\beta E\,dx-e\,(e:dx)}{\beta^2} \cr &= \frac{(\beta E-ee^T)dx}{\beta^2} \cr &= (Y-yy^T)dx \cr }$$

Related Question