Matrix derivative $\frac{\partial}{\partial w} (y^\top g(H(w)) y)$

derivativeslinear algebramatricesmatrix-calculus

I'm trying to solve a matrix derivative, but I don't really know how to handle the two vectors products I guess. I am not particularly proficient in this kind of calculus so I have been using https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf to help me.

Here is what I have done so far:

Captial letters are matrices.

We have the following:
\begin{equation}
H = L + W,
\end{equation}

where, L is symmetric, and W is diagonal, containing vector $w$ on the diagonal. Hence $H$ is also symmetric and has the following property:
\begin{equation}
\frac{\partial H^{-2}}{\partial w} = -2 H^{-3}
\end{equation}

we have:
\begin{equation}
f = y^\top L H^{-2} L y,
\end{equation}

where, y is a vector, and wish to find the derivative of f with regards to $w$.

Let
\begin{equation}
g(H) = H^{-2}.
\end{equation}

Then I get:
\begin{align*}
\frac{\partial f}{\partial w} &= y^\top L \frac{\partial g(H)}{\partial w} L y \\
&= y^\top L \text{Tr}(-2H^{-3}) L y \\
\end{align*}

Which is meaningless?

Where I used the chainrule:
\begin{equation}
\frac{\partial g(H)}{\partial w_{ij}} = \text{Tr}( \frac{\partial g(H)}{\partial H} \frac{\partial H}{\partial w_{ij}})
\end{equation}

I'm not sure what I have done wrong, but I would be grateful if anyone can tell me what I'm doing wrong, and guide me in the right direction.

Best Answer

For typing convenience, define the following symmetric matrices $$\eqalign{ A &= -Lyy^TL = A^T \\ V &= H^{-1} = V^T \\ }$$ The main problem with your analysis is that the quantity $\left(\frac{\partial H^{-2}}{\partial w}\right)$ is a third-order tensor, so it cannot possibly be equal to $-2H^{-3}$ as you've assumed.

However, the differential of a matrix is just another matrix, and is much easier to work with than a third-order tensor.

Let's start with the differential of the inverse, and then its square. $$\eqalign{ I &= HV \\ 0 &= dH\,V + H\,dV \\ 0 &= V\,dH\,V+dV \\ dV &= -V\,dH\,V \\ \\ V^2 &= V\,V\\ dV^2 &= dV\,V + V\,dV \\ &= -(V\,dH\,V^2+V^2dH\,V) \\ }$$ Next calculate the differential and gradient of the objective function. $$\eqalign{ f &= y^TLH^{-2}Ly \\&= Lyy^TL:V^2 \\&= -A:V^2 \\ df &= -A:dV^2 \\ &= +A:(V\,dH\,V^2+V^2dH\,V) \\ &= (VAV^2:dH) + (V^2AV:dH) \\ &= V(VA+AV)V:dH \\ }$$ At this point, note that $$\eqalign{ H &= L + \operatorname{Diag}(w) \\ dH &= \operatorname{Diag}(dw) \\ }$$ and substitute to obtain $$\eqalign{ df &= V(VA+AV)V:{\rm Diag}(dw) \\ &= {\rm diag}\Big(V(VA+AV)V\Big):dw \\ \frac{\partial f}{\partial w} &= {\rm diag}\Big(V(VA+AV)V\Big) \\ &= -{\,\rm diag}\Big(V(VLyy^TL+Lyy^TLV)V\Big) \\ &= -{\,\rm diag}\Big(H^{-2}Lyy^TLH^{-1}+H^{-1}Lyy^TLH^{-2}\Big) \\ }$$ NB: In the above, a colon is used as a convenient notation for the trace operation, i.e. $$A:B = {\rm Tr}(A^TB)$$ The cyclic property of the trace allows terms in such a product to be rearranged in a number of ways, e.g. $$\eqalign{A:BC &= AC^T:B \\&= B^TA:C \\&= BC:A \\&= etc}$$ The diag() function extracts the main diagonal of its matrix argument and returns it as a column vector, while the Diag() function takes a vector argument and returns a diagonal matrix.

Update

Since you asked about it, here is how the third-order gradient can be calculated.

Start by introducing a third-order tensor ${\cal F}$ and a fourth-order tensor ${\cal E}$ whose components can be written as $$\eqalign{ {\cal F}_{ijk} &= \begin{cases} 1 \quad&{\rm if\;} i=j=k \\ 0 \quad&{\rm otherwise} \\ \end{cases} \\ {\cal E}_{ijkl} &= \begin{cases} 1 \quad&{\rm if\;} i=k {\rm\;and\,} j=l \\ 0 \quad&{\rm otherwise} \\ \end{cases} \\ }$$ These tensors are useful because of the following properties $$\eqalign{ {\rm Diag}(w) &= {\cal F}\cdot w \\ {\rm diag}(A) &= {\cal F}:A \\ ABC &= \big(A\cdot{\cal E}\cdot C^T\big):B \\ }$$ Applying this to the above differential formula yields $$\eqalign{ dV^2 &= -(V\,dH\,V^2+V^2dH\,V) \\ &= -(V\cdot{\cal E}\cdot V^2+V^2\cdot{\cal E}\cdot V):dH \\ dH^{-2} &= -(V\cdot{\cal E}\cdot V^2+V^2\cdot{\cal E}\cdot V):{\cal F}\cdot dw \\ \frac{\partial H^{-2}}{\partial w} &= -(V\cdot{\cal E}\cdot V^2+V^2\cdot{\cal E}\cdot V):{\cal F} \\ }$$ where the various dot products with tensors are defined in index notation as $$\eqalign{ {\cal P} &= {\cal B}:{\cal C} \quad&\implies {\cal P}_{ijmn} &= \sum_k\sum_l{\cal B}_{ijkl}\,{\cal C}_{klmn} \\ {\cal Q} &= {\cal B}\cdot{\cal C} &\implies {\cal Q}_{ijkmnp} &= \sum_l{\cal B}_{ijkl}\,{\cal C}_{lmnp} \\ }$$ Having derived an expression for a typical higher-order tensor gradient, I hope you understand why you will never need it. The only reason anyone asks for it, is because they want to use it in a misguided attempt to apply the chain rule.

But instead of the chain rule, one should approach these problems using differentials.

Another workable approach is to use vectorization (aka column-stacking) to reshape every matrix into a (long) column vector.