Taylor expansion of a function of a symmetric matrix

linear algebramatricesmatrix-calculustaylor expansion

First of all let me tell you that the answer to this question is likely to confirm a not-so-minor error in a very popular (and excellent) textbook on optimization, as you'll see below.

Background

Suppose that we have a real-valued function $f(X)$ whose domain is the set of $n\times n$ nonsingular symmetric matrices. Clearly, $X$ does not have $n^2$ independent variables; it has $n(n+1)/2$ independent variables as it's symmetric. As is well known, an important use of Taylor expansion is to find the derivative of a function by finding the optimal first-order approximation. That is, if one can find a matrix $D \in \mathbb{R}^{n\times n}$ that is a function of $X$ and satisfies

$$f(X+V) = f(X) + \langle D, V \rangle + \text{h.o.t.}, $$
where $\text{h.o.t.}$ stands for higher-order terms and $\langle \cdot, \cdot \rangle$ is inner product, then the matrix $D$ is the derivative of $f$ w.r.t. $X$.

Question

Now my question is: What is the right inner product $\langle \cdot, \cdot \rangle$ to use here if the matrix is symmetric? I know that if the entries of $X$ were independent (i.e., not symmetric), then the $\text{trace}$ operator would be the correct inner product. But I suspect that this is not true in general for a symmetric matrix. More specifically, my guess is that even if the $\text{trace}$ operator would lead to the correct expansion in the equation above, the $D$ matrix that comes as a result won't give the correct derivative. Here is why I think this is the case.

A while ago, I asked a question about the derivative of the $\log\det X$ function, because I suspected that the formula in the book Convex Optimization of Boyd & Vandenberghe is wrong. The formula indeed seems to be wrong as the accepted answer made it clear. I tried to understand what went wrong in the proof in the Convex Optimization book. The approach that is used in the book is precisely the approach that I outlined above in Background. The authors show that the first-order Taylor approximation of $f(X)=\log\det X$ for symmetric $X$ is
$$ f(X+V) \approx f(X)+\text{trace}(X^{-1}V). $$

The authors prove this approximation by using decomposition specific to symmetric matrices (proof in Appenix A.4.1; book is publicly available). Now this approximation is correct but $X^{-1}$ is not the correct derivative of $\log\det X$ for symmetric $X$; the correct derivative is $2X^{-1}-\text{diag}(\text{diag}(X^{-1}))$. Interestingly, the same approximation in the formula above holds for nonsymmetric invertible matrices too (can be shown with SVD decomposition), and in this case it does give the right derivative because the derivative of $\log\det X$ is indeed $X^{-T}$ for a matrix with $n^2$ independent entries. Therefore I suspect that $\text{trace}$ is not the right inner product $\langle \cdot, \cdot \rangle$ for symmetric matrices, as it ignores the fact that the entries of $X$ are not independent. Can anyone shed light on this question?

Added: A simpler question

Based on a comment, I understand that the general answer to my question may be difficult, so let me ask a simpler question. The answer to this question may be sufficient to show what went wrong in the proof in the Convex Optimization book.

Suppose $g(X)$ is a function $g: \mathbb{R}^{n\times n} \to \mathbb R$. Is it true that the first-order Taylor approximaton with trace as inner product, i.e.,

$$g(X+V) \approx g(X) + \text{trace}\left( \nabla g (X)^T V \right), $$

implicitly assumes that the entries of $X$ are independent? In other words, is it true that this approximation may not hold if entries of $X$ are not independent (e.g., if $X$ is symmetric)?

Best Answer

Consider a pair matrices with elements given by $$\eqalign{ M_{ij} &= \begin{cases} 1 &\text{if }(i=j) \\ \frac{1}{2} & \text{otherwise}\end{cases} \\ W_{ij} &= \begin{cases} 1 &\text{if }(i=j) \\ 2 & \text{otherwise}\end{cases} \\ }$$ which are Hadamard inverses of each other, i.e. $\;M\odot W={\tt1}$

Suppose that you have been given a function, and by hard work you have calculated its gradient $G$ and its Taylor expansion $$f(X+dX) \approx f(X) + G:dX$$ where the colon denotes the Frobenius inner product $\;A:B={\rm Tr}(A^TB)$

Everything looks great until someone points out that your problem has a symmetry constraint $$X={\rm Sym}(X)\doteq\tfrac{1}{2}\left(X+X^T\right)$$ The constraint implies $(X,G)$ are symmetric, so you might think the constrained gradient is $$\eqalign{ H &= {\rm Sym}(G) \\ }$$ but this is not correct. Fortunately, there is a way to calculate $H$ from $G$ $$\eqalign{ H &= W\odot{\rm Sym}(G) = W\odot G \quad\implies\quad G = M\odot H \\ }$$ Substituting this into the Taylor expansion yields $$\eqalign{ f(X) + G:dX &= f(X) + (M\odot H):dX \\ &= f(X) + H:(M\odot dX) \\ &= f(X) + (\sqrt{M}\odot H):(\sqrt{M}\odot dX) \\ }$$ NB: These matrices are symmetric with only $\left(\frac{n(n+1)}{2}\right)$ independent components.

You might think of the last expansion formula as the standard inner product after each factor has been projected using the elementwise square root of the $M$ matrix.

The Frobenius $\times$ Hadamard product generates a scalar triple product, i.e. $$A:B\odot C = \sum_i\sum_j A_{ij}B_{ij}C_{ij}$$ The order of the three matrices does not affect the value of this product.

Interestingly, if you had to enforce a skew constraint, i.e. $$X={\rm Skw}(X)\doteq\tfrac{1}{2}\left(X-X^T\right)$$ then the constrained gradient would satisfy your intuition
$$H={\rm Skw}(G)$$ with $\left(\frac{n(n-1)}{2}\right)$ independent components.

Related Question