Why is trace operation involved when calculating gradient and Hessian

derivativeshessian-matrixmatricesmatrix-calculustrace

The following passage is from section $11.8.3$ of Boyd & Vandenberghe's Convex Optimization,

We consider the SDP $$\operatorname{minimize}\quad c^Tx\\
\text{subject to} \quad \sum_{i=1}^n x_iF_i + G\preceq 0,$$

with variable $x \in \mathbf{R}^n$, and parameters $F_1,\dots, F_n, G \in\mathbf{S}^p$.
The associated centering problem, using the log-determinant barrier function, is
$$\operatorname{minimize}\quad tc^T x − \log \det \left( −\sum_{i=1}^n x_i F_i − G \right).$$
The Newton step $\Delta x_{nt}$ is found from $H\Delta x_{nt} = −g$, where the Hessian and gradient
are given by
$$H_{ij} = \mathbf{tr}(S^{−1}F_iS^{−1}F_j),\quad i, j = 1,\dots, n$$
$$g_i = tc_i + \mathbf{tr}(S^{−1}F_i),\quad i = 1,\dots, n,$$
where $S = −\displaystyle\sum_{i=1}^n x_iF_i − G$.


From the Matrix Cookbook, I got

$$\frac{\partial\ln\det(S)}{\partial S}=S^{-1}, \qquad \frac{\partial\operatorname{tr}(AS^{-1}B)}{\partial S}=-(S^{-1}BAS^{-1})^T$$

Since all matrices are symmetric in our context,

$$\frac{\partial\operatorname{tr}(S^{-1}F_i)}{\partial S}=-S^{-1}F_iS^{-1}F_j$$

I know that $g_i$ denotes the gradient of log-determinant barrier function w.r.t $x_i$. In terms of size constraints, trace is involved to get a scalar, why do not we use other scalarization operators? I encountered many cases in which trace is invovled when computing derivatives, but I haven't figured out why trace operation is useful in such contexts. I think lots of people would like to know the reason. Any instrucitons will be apppreciated.

Best Answer

$ \def\E{{\cal E}}\def\G{{\cal G}}\def\X{{\cal X}} \def\L{\left}\def\R{\right} \def\LR#1{\L(#1\R)} \def\BR#1{\L(#1\R)} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\qfq{\quad\iff\quad} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} $Consider the gradient and differential of a scalar-valued function $(\phi)$ of a third-order tensor $(\X)$ $$\eqalign{ \grad{\phi}{\X_{ijk}} = \G_{ijk} \qfq d\phi = \sum_{i,j,k}\;\G_{ijk}\;d\X_{ijk} \\ }$$ Notice that one must sum over every component of the independent variable.

The situation is exactly the same when the independent variable is a second-order tensor (i.e. a matrix) $-$ except there's one less index. $$\eqalign{ \grad{\phi}{X_{ij}} = G_{ij} \qfq d\phi = \sum_{i,j}\;G_{ij}\;dX_{ij} \\ }$$ But notice that the expression on the RHS can be written using the trace function $$\eqalign{ d\phi = \sum_{i,j}\;G_{ij}\;dX_{ij} \;\doteq\; \trace{G^TdX} \\ }$$ This is the reason that the trace appears so often in matrix calculus.