Let us assume that I have a matrix $A \in \mathbb{R}^{m \times n}$ and I computed the Gram matrix $G = A^{T} A$. I would like to take the derivative of $G^{-1/2}$ with respect to $A$. I have seen some approaches utilizing eigenvalue decomposition and trying to obtain the derivative of the inverse square root from the derivative of the inverse of the Gram matrix. However, I cannot totally understand them. I would appreciate your help to solve the problem. Thanks a lot in advance.
The derivative of the inverse square root of a Gram matrix
derivativesmatrices
Related Solutions
What follows is an extension of the previous comments, to derive an explicit expression in terms of Kronecker sum. Taking differential $\mathrm{d}(\cdot)$ to both sides of $\sqrt{A}\sqrt{A} = A$ results a special case of Sylvester equation $$(\mathrm{d}\sqrt{A}) \sqrt{A} \: + \: \sqrt{A} (\mathrm{d}\sqrt{A}) = \mathrm{d}A, $$ which can be solved for the differential matrix $\mathrm{d}\sqrt{A}$ as $$ \text{vec}(\mathrm{d}\sqrt{A}) = \left(\sqrt{A}^{\top} \oplus \sqrt{A}\right)^{-1} \: \text{vec}(\mathrm{d}A). $$ Since $A$ is positive definite, $\sqrt{A}$ is unique and positive definite, and hence the Kronecker sum is positive definite (thus non-singular). Further, since the differential and vec operator can be interchanged in the left hand side of the equation above, the Jacobian identification rule (p. 198 in Magnus and Neudecker, Matrix Differential Calculus with Applications in Statistics and Econometrics, 3rd ed., chapter 9, section 5) results $$ \mathrm{D}\sqrt{A} = \left(\sqrt{A}^{\top} \oplus \sqrt{A}\right)^{-1}, $$ where the transpose can be dispensed if $A$, in addition to being positive-definite, is also symmetric, as the OP asked. Notice that for generic matrix function $F: \mathbb{R}^{p\times q} \mapsto \mathbb{R}^{m\times n}$, the Jacobian is defined as $\mathrm{D}F(X) \triangleq \displaystyle\frac{\partial \: \text{vec}(F(X))}{\partial \: (\text{vec}(X))^{\top}}$, and is of size $mn \times pq$.
It is generally true that if $A$ is an $n\times n$ invertible and if $A^{-1}$ has a "square root" $C$, also $n\times n$, such that:
$$ A^{-1} = C^2 $$
then $C^{-1} A^{-1} C^{-1} = I$ holds.
The first fact we need is that since $A$ is invertible, $A^{-1}$ is invertible, and this implies $C$ is invertible. For if not, then there would exist a nonzero vector $x$ in the nullspace of $C$, and $Cx=0$ would imply $A^{-1}x=0$, contradicting the invertibility of $A^{-1}$. Thus $A = (C^2)^{-1} = (C^{-1})^2$.
The second fact we need is that a one-sided inverse of a matrix is a two-sided inverse, so that:
$$ A C^2 = I \; \implies \; C A C = I $$
That is, using associativity of matrix multiplication, the left hand side tells us $(AC)C = I$, so that $C$ is a (right) inverse of $AC$. Thus it must also be a (left) inverse of $AC$, which is what the right hand equation states.
Finally by the same reasoning:
$$ C A C = I \; \implies \; C^{-1} A^{-1} C^{-1} = I $$
In this discussion/proof we have not invoked the symmetry of $A$ nor the uniqueness of a symmetric positive definite square root $C$ for $A^{-1}$, which also would be symmetric positive definite. The reasoning above is correct even if $A$ is not symmetric, and even if $C$ is not positive definite, and relies only on $A^{-1} = C^2$.
Best Answer
$\def\v{{\rm vec}}\def\M{{\rm Mat}}\def\d{{\rm diag}}\def\D{{\rm Diag}}\def\p#1#2{\frac{\partial #1}{\partial #2}}$Given the matrix $A$, define the symmetric matrices $$\eqalign{ G &= A^TA, \qquad F^2 = G^{-1} \quad\implies\quad F = G^{-1/2} \\ }$$ then calculate their differentials, vectorize, and solve for the desired gradient. $$\eqalign{ &dG = A^TdA + dA^TA \\ &dg = \v(dG) = \Big((I\otimes A^T) + (A^T\otimes I)K\Big)da \\ &F\,dF+dF\,F = -G^{-1}dG\,G^{-1} \\ &\big((I\otimes F)+(F\otimes I)\big)df = -\big(G^{-1}\otimes G^{-1}\big)dg \\ &\big(F\oplus F\big)df = -\big(G^{-1}\otimes G^{-1}\big)dg \\ &df = -\big(F\oplus F\big)^{-1}\big(G^{-1}\otimes G^{-1}\big)dg \\ &df = -\big(F\oplus F\big)^{-1}\big(G^{-1}\otimes G^{-1}\big) \Big((I\otimes A^T) + (A^T\otimes I)K\Big)da \\ &df = B\,da \quad\implies\quad B = \p{f}{a} \\\\ }$$
If you're content with a vectorized result then you can stop here.
If you require the full matrix-by-matrix gradient, then read on.
A pair of zero-one third-order tensors $$\eqalign{ {\vec\nu}_{\ell jk} &= \begin{cases} 1\quad{\rm if}\;\;\ell=j+km-m \\ 0\quad{\rm otherwise} \\ \end{cases} \\ {\vec\mu}_{jk\ell} &= \; {\vec\nu}_{\ell jk} \\ {\tt1}\le&j\le m,\quad {\tt1}\le k\le n \\ }$$ can be used to convert a variable between its vector $(\vec\nu)$ and matrix $(\vec\mu)$ forms $$\eqalign{ a &= \vec\nu:A \quad&\iff\quad A=\vec\mu\cdot a \\ }$$ and they allow the above result to be converted from a vector-by-vector (aka matrix) gradient into a matrix-by-matrix (aka fourth-order tensor) gradient $$\eqalign{ df &= B\cdot da \\ \vec\mu\cdot df &= \vec\mu\cdot B\cdot (\vec\nu:dA) \\ dF &= \big(\vec\mu\cdot B\cdot\vec\nu\big):dA \\ \p{F}{A} &= \vec\mu\cdot B\cdot\vec\nu \\ }$$ or in component notation $$\eqalign{ \p{F_{jk}}{A_{pq}} &= \sum_{\varepsilon=1}^{n^2}\sum_{\ell=1}^{mn} {\vec\mu}_{jk\varepsilon} B_{\varepsilon\ell} {\vec\nu}_{\ell pq} \\\\ }$$
In the preceeding, $\oplus$ denotes the Kronecker sum, $K$ denotes the Commutation Matrix associated with Kronecker products, and a colon denotes the double-dot (aka trace or Frobenius) product $$\eqalign{ A:Z &= \sum_{i=1}^m \sum_{j=1}^n A_{ij} Z_{ij} \;=\; {\rm Tr}(AZ^T) \\ A:A &= \big\|A\big\|^2_F \\ }$$