[Math] Derivation of derivative of multivariate Gaussian w.r.t. covariance matrix

I'm reading a paper, probabilistic CCA, in which the authors state derivatives without showing derivations. I would like step-by-step derivations to convince myself. Consider a $d$-dimensional multivariate Gaussian random variable:

$$
\textbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)
$$

In probabilistic CCA, we define $\Sigma = W W^{\top} + \Psi$, where $W \in \mathbb{R}^{d \times q}$ and $\Psi \in \mathbb{R}^{d \times d}$. I'd like to compute the derivative w.r.t. $\boldsymbol{\mu}$, $W$, and $\Psi$ for the negative log-likelihood.

The stationary point for $\boldsymbol{\mu}$ is just the empirical mean (shown below*) or $\hat{\boldsymbol{\mu}}$. Plugging in the minimum for the parameter $\boldsymbol{\mu}$ into the negative log-likelihood, we get:

$$
\frac{\partial \mathcal{L}}{\partial W}
=
\frac{\partial}{\partial W} \Big\{
\overbrace{
\frac{1}{2} \sum_{i=1}^{n}(\textbf{x}_i – \hat{\boldsymbol{\mu}})^{\top} \Sigma^{-1} (\textbf{x}_i – \hat{\boldsymbol{\mu}})
}^{A}
+
\overbrace{\frac{n}{2} \ln |\Sigma|}^{B} + \overbrace{\text{const}}^{C}
\Big\}
$$

Clearly, $C = 0$. But I'm not sure how to handle $A$ and $B$, particularly since $\Sigma = W W^{\top} + \Psi$.

*Derivative w.r.t. $\boldsymbol{\mu}$

The negative log-likelihood is:

$$
\mathcal{L}
=
\frac{1}{2} \sum_{i=1}^{n}(\textbf{x}_i – \boldsymbol{\mu})^{\top} \Sigma^{-1} (\textbf{x}_i – \boldsymbol{\mu}) + \frac{n}{2} \ln |\Sigma| + \text{const}
$$

The derivative of the two rightmost terms with respect to $\boldsymbol{\mu}$ is $0$, meaning we just need to compute:

$$
\frac{\partial}{\partial \boldsymbol{\mu}}
\Big\{
\frac{1}{2} \sum_{i=1}^{n}(\textbf{x}_i – \boldsymbol{\mu})^{\top} \Sigma^{-1} (\textbf{x}_i – \boldsymbol{\mu})
\Big\}
=
0
$$

By the linearity of differentiation, we have:

$$
\frac{1}{2}
\sum_{i=1}^{n}
\frac{\partial}{\partial \boldsymbol{\mu}}
\Big\{
(\textbf{x}_i – \boldsymbol{\mu})^{\top} \Sigma^{-1} (\textbf{x}_i – \boldsymbol{\mu})
\Big\}
=
0
$$

Using Equation ($86$) from the Matrix Cookbox, we get:

$$
\frac{1}{2}
\sum_{i=1}^{n}
\Big\{
-2 \Sigma^{-1} (\textbf{x}_i – \boldsymbol{\mu})
\Big\}
=
0
$$

Finally, solve for $\boldsymbol{\mu}$, we get:

$$
\begin{align}
0
&= \frac{1}{2} \sum_{i=1}^{n} \Big\{ -2 \Sigma^{-1} (\textbf{x}_i – \boldsymbol{\mu}) \Big\}
\\
&= – \sum_{i=1}^{n} \Big\{ \Sigma^{-1} \textbf{x}_i – \Sigma^{-1} \boldsymbol{\mu} \Big\}
\\
&= – \sum_{i=1}^{n} \Big\{ \Sigma^{-1} \textbf{x}_i \Big\} + n \Sigma^{-1} \boldsymbol{\mu}
\\
– n \Sigma^{-1} \boldsymbol{\mu} &= – \Sigma^{-1} \sum_{i=1}^{n} \textbf{x}_i
\\
\boldsymbol{\mu} &= \frac{1}{n} \sum_{i=1}^{n} \textbf{x}_i
\end{align}
$$

And we're done.

Best Answer

All those Greek letters are a pain to type, so let's use these variables $$\eqalign{ S = \Sigma,\,\,\,P = \Phi,\,\,\,L={\mathcal L},\,\,\,Z = (X-\mu 1) \cr }$$ where $X$ is the matrix whose columns are the $x_i$ vectors, and $(\mu 1)$ is a matrix all of whose elements are equal to $\mu$.

Further, let's use a colon to denote the trace/Frobenius product $$A:B = {\rm tr}(A^TB)$$ Write the objective function in terms of the Frobenius product and these new variables. Then find its differential and gradients. $$\eqalign{ L &= \tfrac{n}{2}\log(\det(S)) + \tfrac{1}{2}ZZ^T:S^{-1} + K \cr dL &= \tfrac{n}{2}{\rm tr\,}(d\log(S)) + \tfrac{1}{2}ZZ^T:dS^{-1} + 0 \cr &= \frac{1}{2}\Big(nS^{-1} - S^{-1}ZZ^TS^{-1}\Big):dS \cr &= \frac{1}{2}\Big(nS^{-1} - S^{-1}ZZ^TS^{-1}\Big):d(WW^T+P) \cr &= \frac{1}{2}\Big(nS^{-1} - S^{-1}ZZ^TS^{-1}\Big):(dW\,W^T+ W\,dW^T+dP) \cr }$$ Setting $dW=0$ yields the gradient wrt $P$ $$\eqalign{ dL &= \frac{1}{2}\Big(nS^{-1} - S^{-1}ZZ^TS^{-1}\Big):dP \cr \frac{\partial L}{\partial P} &= \frac{1}{2}\Big(nS^{-1} - S^{-1}ZZ^TS^{-1}\Big)\cr }$$ While setting $dP=0$ recovers the gradient wrt $W$ $$\eqalign{ dL &= \frac{1}{2}\Big(nS^{-1} - S^{-1}ZZ^TS^{-1}\Big):(dW\,W^T+ W\,dW^T) \cr &= \Big(nS^{-1} - S^{-1}ZZ^TS^{-1}\Big)W:dW \cr \frac{\partial L}{\partial W} &= \Big(nS^{-1} - S^{-1}ZZ^TS^{-1}\Big)W \cr }$$ In several of the steps, we've made use of the fact that $S$ is symmetric.

*Derivative w.r.t. $\boldsymbol{\mu}$

Best Answer

Related Solutions

[Math] Partial Derivative of Gaussian function: Matrix differentiation

Related Question