[Math] Matrix calculus – computing the Hessian of a vector-Matrix equation.

derivativesmatrix-calculusmultivariable-calculus

Let $\vec{b}=\langle b_1,\dots,b_n\rangle ^T$ be an n-dimensional vector of coefficients. Let $\vec{x}_1,\dots,\vec{x}_n$ be $n$ $p$-dimensional vectors. Let $G(\vec{b})=\log\det\left( \sum_{i=1}^n b_i \vec{x}_i\vec{x}_i^T\right)$.

Let $A=\sum_{i=1}^n b_i \vec{x}_i\vec{x}_i^T$. If one wants to compute the $i$-th component of the gradient, we get

\begin{eqnarray}
\nabla_i G(\vec{b}) &=& \text{Tr}\left(\partial_i A \right) \\
&=& \text{Tr}\left( A^{-1} \vec{x}_i\vec{x}_i^T \right) \\
&=& \text{Tr}\left(\vec{x}_i^T A^{-1} \vec{x}_i \right) \\
&=& x_i^T A^{-1} x_i
\end{eqnarray}

I am filling in the details so far of this paper (page 19, before equation (33)). So far I agree with their calculation. However, I do not understand their calculation of the line (33) and (34) in which they calculate the Hessian.

They claim that
$$
\nabla^2_{ij} (G(\vec{b})) = -(\vec{x}_i^T A^{-1}\vec{x}_j)^2. \tag1
$$

I get something different. Using the Matrix Cookbook (equation (61)), I see that
\begin{eqnarray}
\partial_j(\vec{x}_i^T A^{-1} \vec{x}_i) &=& -A^{-1}\vec{x}_i\vec{x}_i^T A^{-1}\cdot\partial_i(A) \tag2\\
&=& -A^{-1}\vec{x}_i\vec{x}_i^T A^{-1} \vec{x}_j\vec{x}_j^T,
\end{eqnarray}
which is a matrix and not a scalar!

I know I must be making a mistake somewhere. I am still not quite comfortable with matrix calculus.

Can someone help me figure out where I'm going wrong?

Best Answer

The derivation in (2) is wrong. You apply the formula (61) from the Matrix Cookbook, but this formula is for the derivative of the quadratic form with respect to the whole matrix. That's why it is a matrix. You need to calculate the derivative of $G_i'$ with respect of the variable $b_j$ only, so you are rather to use the equation (59) in the Cookbook. Here how it goes $$ G_{ij}''=[G_i']_j'=\frac{\partial x_i^TA^{-1}x_i}{\partial b_j}= x_i^T\frac{\partial A^{-1}}{\partial b_j}x_i\stackrel{\text{Eq.(59)}}{=} x_i^T\Bigl(-A^{-1}\underbrace{\frac{\partial A}{\partial b_j}}_{x_jx_j^T}A^{-1}\Bigr)x_i=\\=-x_i^TA^{-1}x_j\cdot x_j^TA^{-1}x_i $$ which is exactly their expression (1).


UPDATE explaining the formula (61) in the Matrix Cookbook.

In one dimensional calculus we have $f\colon\mathbb{R}\to\mathbb{R}$ and differentiating according to $$ f(x+h)-f(x)=\underbrace{f'(x)\cdot h}_{\partial_x f}+o(|h|).\tag{*} $$ When $x$ becomes a vector and $f\colon\mathbb{R}^n\to\mathbb{R}$, the derivative becomes a vector of partial derivatives (gradient) which normally has the same size as the vector $x$ so that on the place for $x_k$ one gets the partial derivative $\frac{\partial f}{\partial x_k}$ $$ x=\left[\matrix{x_1\\x_2\\\vdots\\x_n}\right]\qquad\Rightarrow\qquad\frac{\partial f}{\partial x}= \left[\matrix{\frac{\partial f}{\partial x_1}\\\frac{\partial f}{\partial x_2}\\\vdots\\\frac{\partial f}{\partial x_n}}\right]. $$ The differential in (*) is then no longer a multiplication, it becomes the scalar product, i.e. the sum of pointwise multiplications between the gradient and the variable $h$ $$ f(x+h)-f(h)=\langle\frac{\partial f}{\partial x},h\rangle+o(\|h\|). $$ When one has a function $f\colon\mathbb{R}^{m\times n}\to \mathbb{R}$, for example, $f(X)=a^TX^{-1}b$ as in (61), it is also convenient to organize the matrix of partial derivatives in the same way that on the position for $x_{ij}$ we get $\frac{\partial f}{\partial x_{ij}}$, and this matrix is denoted by $\frac{\partial f}{\partial X}$, however, the differential is not $\partial f=\frac{\partial f}{\partial X}\partial X$ as you wrote in (2), but as in the vector case the scalar product, i.e. $$ \partial f=\langle\frac{\partial f}{\partial x},\partial X\rangle= \text{tr}\Bigl(\biggl(\frac{\partial f}{\partial x}\biggr)^T\partial X\Bigr). $$ So the corrected version of (2) would be $$ \text{tr}(-A^{-1}x_ix_i^TA^{-1}x_jx_j^T)=-(x_j^TA^{-1}x_i)^2. $$

Related Question