[Math] derivative of multiplication of three matrices with a matrix

In short, my problem is to compute $\frac{d(X^tAX)}{dX}$; where both $A$ and $X$ are matrices.

I have to maximize a negative-log likelihood function $L$

$$L = \frac{1}{2}\ln(|\Sigma|)+\frac{1}{2}\varepsilon^t\Sigma^{-1}\varepsilon;$$

where $\Sigma$ is the covariance matrix, $\varepsilon$ is a column vector of residuals (in my case) and $t$ denotes the transponse. The probelm is that $\Sigma$ is a function of other matrices

$$\Sigma = J^tCJ$$

where, both $J$ and $C$ are matrices. The matrix $J$ is again a function of a vector $\lambda$. I have to maximize the function $L$ w.r.t. vector $\lambda$. I tried chain rule to solve this problem as follows

$$\frac{dL}{d\lambda}=0\implies\frac{dJ}{d\lambda}\frac{d\Sigma}{dJ}\frac{dL}{d\Sigma}=0.$$

In the above equation, $\frac{dJ}{d\lambda}$ and $\frac{d\Sigma}{dJ}$ becomes a tensor. So, I am no longer able to write these quantities on paper. Also, taking derivative of $L$ w.r.t. an individual element of $ \lambda$ does not solve the problem. There are few online resources which suggest using $vec$ operator to deal with tensors; but they heavily use Kronecker product etc. which I have not been able to understand very well, because most of the online material is very opaque.

Can someone please point me to the solution? If someone can refer a good text dealing with a similar problem, that would be great.

Best Answer

As you've discovered, the chain rule can be difficult to use with matrix functions. Instead let's stick with differentials, and change the independent variable as necessary, until we obtain an expression in terms of $d\lambda.\,$ Then in the final step, we can recover the gradient.

For convenience, define some new variable $$\eqalign{ S &= S^T = \Sigma =J^TCJ \cr E &= E^T = ee^T \cr M &= M^T = S^{-1} -S^{-1}ES^{-1} \cr G &= \frac{\partial J}{\partial\lambda} \cr }$$ Let's also use a colon to denote the inner/Frobenius product, which is a convenient notation for the matrix trace, $$A:B={\rm tr}(A^TB)$$

Now we can write the objective function in terms of these definitions $$\eqalign{ 2L &= \log\det S + E:S^{-1} \cr \cr 2\,dL &= d\log\det S + E:dS^{-1} \cr &= d{\rm tr}\log S - E:S^{-1}\,dS\,S^{-1} \cr &= (S^{-1} - S^{-1}ES^{-1}):dS \cr &= M:dS \cr &= M:(dJ^T\,CJ + J^TC\,dJ) \cr &= M(CJ)^T:dJ^T + C^TJM:dJ \cr &= (CJM + C^TJM):dJ \cr &= (C+C^T)JM:G\,d\lambda \cr \cr \frac{\partial L}{\partial\lambda} &= \frac{1}{2}(C+C^T)JM:G \cr &= \frac{1}{2}(C+C^T)\,J\,(\Sigma^{-1}-\Sigma^{-1}ee^T\Sigma^{-1}):\frac{\partial J}{\partial\lambda} \cr \cr }$$ I think that last expression has cast everything back in terms of your original variables.

Note that the rules for rearranging the Frobenius product follow from the properties of the trace. Here is a quick list $$\eqalign{ A:BC &= B^TA:C \cr &= AC^T:B \cr &= BC:A \cr &= (BC)^T:A^T \cr }$$

Best Answer

Related Solutions

[Math] matrix derivative relation to vec operator and kronecker product

[Math] Partial Derivative of Gaussian function: Matrix differentiation

Related Question