Derivative of matrix-valued function with respect to matrix

derivativesmatrix-calculus

I have seen lots of people asking this question – $dF/dW = ??$ when $F = WX$. Here $W$ is a $m \times n$ matrix and $X$ is $n \times p$ matrix.

The simple answer they give is $X^{T}$. How did it appear to be like this?

I googled this question – CS231N of stanford gave an explanation of this thing. Yes if you derive it – it is supposed to be a higher order tensor (4 free indices). It is kind of like a matrix whose elements are itself a matrix.

In case you are thinking whether I checked this site questions before asking this question and thinking of closing this question – I would show some of my findings from here and other resources I came by.

  1. This question attempted to demystify the answer. The answer given here is elaborate. But wait a sec, here he mentioned that this can be realized using Kronecker product. Now isn't it a bit way around? What if we want to derive it from the basic rules? (Like multiply two matrices and then deriving each of the $mp$ terms w.r.t all the matrix elements of $X$.

  2. Resources mentioned in CS231N. Yes I checked those. I understand the materials on matrix derivative. And no, I can't find the correlation between these two.

What am I missing? How to derive these kind of expressions from the basics?

I want to make sure that I understand this. Thanks.


  1. The CS231N resource I mentioned. link – Vector, Matrix, and Tensor Derivatives
    Erik Learned-Miller
  2. Another resource from the same CS231N course link- Derivatives, Backpropagation, and Vectorization
    Justin Johnson

Best Answer

In index notation, the function can be written as $$F_{ik} = W_{ij} X_{jk}$$ The indices $\{i,k\}$ are not repeated and are called "free" indices,
but $\{j\}$ is a repeated "dummy" index and is implicitly summed over.

Now calculate the derivative with respect to the component $W_{qr}$ $$\eqalign{ \frac{\partial F_{ik}}{\partial W_{qr}} &= \frac{\partial W_{ij}}{\partial W_{qr}}\;X_{jk} \\ &= \delta_{iq}\delta_{rj}\;X_{jk} \\ &= \delta_{iq}\;X_{rk} \\ }$$ The symbol $\delta_{iq}$ is called a Kronecker delta. When $i=q$ it equals ${\tt 1}$ otherwise it's equal to $0$.

Since the derivative has 4 free indices, it is a 4th order tensor, whose dimensions are $(m\times p\times m\times n)$

Since higher order tensors are awkward to work with, most texts flatten the matrices $(F,W)$ into the vectors $(f,w)$ and then calculate the derivative using ordinary matrix notation. $$\eqalign{ {\rm vec}(F) &= {\rm vec}(IWX) = (X^T\otimes I)\,{\rm vec}(W) \\ f &= (X^T\otimes I)\,w \\ df &= (X^T\otimes I)\,dw \\ \frac{\partial f}{\partial w} &= (X^T\otimes I) \\ }$$ This result is a matrix, not a tensor; the symbol $\otimes$ represents the Kronecker product.