Gradient of $\operatorname{Tr}( \exp{(H+\log{X})})$ w.r.t $X$.

gradient descentlogarithmsmatrices

When I learned about Lieb's inequality, I meet this problem.
In $\operatorname{Tr}( \exp{(H+\log{X})})$, $X$ is a square matrix, and the simplest case can be diagonal. $H$ is a Hermitian matrix, but I think it has not any effect on the gradient.

I have tried the first kind of calculation

The general formula for the gradient of the trace of this function applied to a matrix argument $X$ is [cf. paragraph 2.5 of The Matrix Cookbook ]

$\frac{\partial \operatorname{Tr} (F(X))}{\partial X} =f(X^{T})$,
where $f(\cdot)$ is the scalar derivative $F(\cdot)$.

I am not sure what "scalar derivative" means. I understand it as replacing the matrix argument $X$ with a scale $x$, that is, in my calculations,

$F(X)= [\exp{((\log{X}+H))}]$, thus, $f(X^{T})= [\exp{((\log{X^{T}}+H))}] * (X^{T})^{-1}$ .

I am very puzzled by this result. If $X$ is a diagonal matrix, $H$ is not a non-diagonal matrix, then the gradient w.r.t. $X$ derived from the above equation is a non-diagonal. But $\frac{\partial \operatorname{Tr} (F(X))}{\partial X} $ should be diagonal when $X$ is limited to a diagonal matrix.

Next, I have tried the second kind of calculations

$d \operatorname{Tr}( \exp{(\log{X}+H)})= \operatorname{Tr} [d ( \exp{(\log{X}+H)})]$

$= \operatorname{Tr}[\int_{0}^{1} \exp{(\alpha(\log{X}+H))} d(\log{X}+H) \exp{((1-\alpha)(\log{X}+H))} d \alpha] $

$= \operatorname{Tr}[\int_{0}^{1} \exp{(\alpha(\log{X}+H))} \exp{((1-\alpha)(\log{X}+H))} d \alpha d(\log{X}+H)] $

$= \operatorname{Tr} [\exp{((\log{X}+H))} d(\log{X}+H) ] $

$= \operatorname{Tr} [\exp{((\log{X}+H))} ]\operatorname{Tr} [d(\log{X}+H) ] $

$= \operatorname{Tr} [\exp{((\log{X}+H))} ] [d \operatorname{Tr}(\log{X}+H) ] $

$= \operatorname{Tr} [\exp{((\log{X}+H))} ] (X^{T})^{-1} dX$

Can someone help me with the right calculation?

Best Answer

$ \def\o{{\tt1}}\def\p{\partial} \def\L{\left}\def\R{\right} \def\LR#1{\L(#1\R)} \def\BR#1{\Bigg(#1\Bigg)} \def\fracLR#1#2{\L(\frac{#1}{#2}\R)} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} $Defining the matrix variable and its differential $$\eqalign{ W &= \LR{X+I}^{-1}\LR{X-I} \\ dW &= \LR{X+I}^{-1}\,dX - \LR{X+I}^{-1}\,dX\,\LR{X+I}^{-1}\LR{X-I} \\ &= \LR{X+I}^{-1}\,dX - \LR{X+I}^{-1}\,dX\;W \\ &= \LR{X+I}^{-1}\,dX\LR{I-W} \\ \\ X &= \LR{I-W}^{-1}\LR{I+W} \\ I &= \LR{I-W}^{-1}\LR{I-W} \\ \LR{X+I} &= 2\LR{I-W}^{-1} \qiq \LR{X+I}^{-1} = \tfrac 12\LR{I-W} \\ }$$ then extending this post to a matrix argument yields formulas for the logarithm and its differential $$\eqalign{ \log(X) &= \sum_{k=0}^\infty \LR{\frac{2}{2k+1}}W^{2k+1} \\ d\log(X) &= \sum_{k=0}^\infty \LR{\frac{2}{2k+1}} \sum_{j=\o}^{2k+1} W^{j-\o}\,dW\;W^{2k+\o-j} \\ &= \sum_{k=0}^\infty \LR{\frac{2}{2k+1}} \sum_{j=\o}^{2k+1} W^{j-\o}\LR{X+I}^{-1}\,dX\LR{I-W}W^{2k+\o-j} \\ &= \sum_{k=0}^\infty \sum_{j=\o}^{2k+1} \fracLR{W^{j-\o}\LR{I-W}\,dX\LR{I-W}W^{2k+\o-j}}{2k+1} \\ }$$ Now define the matrix variable $$A=H+\log(X) \qiq dA = d\log(X)$$ and apply the formula from the Cookbook $$\eqalign{ \phi &= \trace{e^A} \\ d\phi &= \LR{e^A}^T:dA \\ &= \sum_{k=0}^\infty \sum_{j=\o}^{2k+1} \fracLR{ W^{j-\o}\LR{I-W}\,e^A\LR{I-W}W^{2k+\o-j} }{2k+1}^T\!:dX \\ \grad{\phi}{X} &= \sum_{k=0}^\infty \sum_{j=\o}^{2k+1} \fracLR{ W^{j-\o}\LR{I-W}\,e^A\LR{I-W}W^{2k+\o-j} }{2k+1}^T \\ \\ }$$


In the above derivation, a colon is used as a convenient product notation for the trace $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \big\|A\big\|^2_F \\ }$$ The properties of the underlying trace function allow the terms in a colon product to be rearranged in many different ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ C:AB &= CB^T:A = A^TC:B \\ }$$