Calculating the gradient of a particular multivariate normal

derivativesmatrix-calculus

Suppose I have negative twice a log-likelihood of some multivariate normal data: $x_1, \ldots, x_n$. I assume the covariance matrix has a low-rank structure: $LL^\intercal + \Psi$, where $\Psi$ is diagonal, and $L$ is lower-triangular and low rank. The mean vector is $\mu$.

\begin{align}
f(\mu, L, \Psi) &= – 2 \log L(x_1, \ldots, x_n ; \mu, L, \Psi)\\
&= \text{const} + n\log \det(LL^\intercal + \Psi) + \sum_{i=1}^n (x_i – \mu)^\intercal (LL^\intercal + \Psi)^{-1} (x_i – \mu) \\
&= \text{const} + n\log \det(LL^\intercal + \Psi) + \text{trace}\left[ (LL^\intercal + \Psi)^{-1}\sum_{i=1}^n (x_i – \mu)(x_i – \mu)^\intercal\right]
\end{align}

I assume that because these matrices lie on a lower-dimensional space, the gradient can be written as

\begin{align*}
\nabla f(\mu, L, \Psi)
&= \begin{bmatrix}
\frac{\partial f}{\partial \mu} \\
\frac{\partial f}{\partial \text{vech} (L)} \\
\frac{\partial f}{\partial \text{diag} (\Psi)}
\end{bmatrix}
\end{align*}

Finding the first block of that vector is straightforward:

\begin{align*}
\frac{\partial f}{\partial \mu}
&= – 2 (LL^\intercal + \Psi)^{-1}n \bar{x} + 2 n (LL^\intercal + \Psi)^{-1}\mu
\end{align*}

However, I'm stuck on the other two pieces.

Best Answer

$ \def\bs{\boldsymbol} \def\o{{\tt1}} \def\P{{\Psi}} \def\M{P^{-1}} \def\LR#1{\left(#1\right)} \def\op#1{\operatorname{#1}} \def\vecc#1{\op{vec}\LR{#1}} \def\vech#1{\op{vech}\LR{#1}} \def\diag#1{\op{diag}\LR{#1}} \def\Diag#1{\op{Diag}\LR{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\qif{\quad\iff\quad} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} \def\Sk{\sum_{k=1}^n} $For typing convenience, define the all-ones vector $\o$ and the variables $$\eqalign{ P &= {\P+LL^T} \\ dP &= d\P+dL\,L^T+L\,dL^T \\ p &= \diag{\P} \qif \P = \Diag{p} \\ h &= \vech{L} = {Ev} \\ v &= \vecc{L} \:= \c{E^Th} \\ }$$ where $E$ is the Elimination Matrix and the last equality in $\c{\rm red}$ is only true if $L$ is lower triangular.

Replace the vector summation by a matrix whose columns are the {$x_k$} vectors $$\eqalign{ X &= {\bs[}\,x_1\;x_2\:\cdots\:x_n\,{\bs]} \\ M &= {\bs[}\;\mu\;\;\mu\;\;\cdots\;\;\mu\,{\bs]} \;= \mu\o^T \\ Y &= {M-X} \qiq dY = d\mu\,\o^T \\ Z &= \M Y \\ YY^T &= \Sk \LR{x_k-\mu}\LR{x_k-\mu}^T \\ }$$ and introduce the Frobenius product, which is a concise notation for the trace $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \|A\|^2_F \\ }$$ The properties of the underlying trace function allow the terms in a Frobenius product to be rearranged in many different ways, e.g. $$\eqalign{ A:B &= B:A \;=\; \vecc{A}:\vecc{B} \\ A:B &= A^T:B^T \\ C:\LR{AB} &= \LR{CB^T}:A = \LR{A^TC}:B \\ \Diag{a}:B &= a:\diag{B} \\ \\ }$$

Write the function using the above notation, then calculate its differential $$\eqalign{ f &= f_0 + n\log\det(P) + YY^T:\M \\ \\ df &= n\,d\LR{\log\det P} + YY^T:{d\M} + \M:d\LR{YY^T} \\ &= n\LR{P^{-1}:dP} \;+\; YY^T:\LR{-\M\,dP\,\M} \;+\; \M:\LR{dY\,Y^T+Y\,dY^T} \\ &= nP^{-1}:\c{dP} \;-\; ZZ^T:\c{dP} \;+\; 2\M Y:dY \\ &= \LR{nP^{-1}-ZZ^T}:\CLR{d\P+dL\,L^T+L\,dL^T} \;+\; 2\M Y:\LR{d\mu\,\o^T} \\ &= \LR{nP^{-1}-ZZ^T}:d\P \;+\; 2\LR{nP^{-1}-ZZ^T}L:dL \;+\; 2\M Y\o:d\mu \\ &= \diag{nP^{-1}-ZZ^T}:dp \;+\; 2\vecc{nP^{-1}L-ZZ^TL}:dv \;+\; 2\M Y\o:d\mu \\ \\ &= \diag{nP^{-1}-ZZ^T}:dp \;+\; 2\vecc{nP^{-1}L-ZZ^TL}:E^Tdh \;+\; 2\M Y\o:d\mu \\ &= \diag{nP^{-1}-ZZ^T}:dp \;+\; 2E\,\vecc{nP^{-1}L-ZZ^TL}:dh \;+\; 2\M Y\o:d\mu \\ }$$ Now isolate the respective gradients as $$\eqalign{ \grad{f}{p} &= \diag{nP^{-1}-ZZ^T} \\ \grad{f}{v} &= 2\vecc{nP^{-1}L-ZZ^TL} \\ \grad{f}{h} &= 2E\,\vecc{nP^{-1}L-ZZ^TL} \\ \grad{f}{\mu} &= 2\M Y\o \\ }$$ Note that $$E\,\vecc{nP^{-1}L-ZZ^TL}\ne\vech{nP^{-1}L-ZZ^TL}$$ because the matrix argument is not symmetric.

Related Solutions

[Math] Derivation of derivative of multivariate Gaussian w.r.t. covariance matrix

All those Greek letters are a pain to type, so let's use these variables $$\eqalign{ S = \Sigma,\,\,\,P = \Phi,\,\,\,L={\mathcal L},\,\,\,Z = (X-\mu 1) \cr }$$ where $X$ is the matrix whose columns are the $x_i$ vectors, and $(\mu 1)$ is a matrix all of whose elements are equal to $\mu$.

Further, let's use a colon to denote the trace/Frobenius product $$A:B = {\rm tr}(A^TB)$$ Write the objective function in terms of the Frobenius product and these new variables. Then find its differential and gradients. $$\eqalign{ L &= \tfrac{n}{2}\log(\det(S)) + \tfrac{1}{2}ZZ^T:S^{-1} + K \cr dL &= \tfrac{n}{2}{\rm tr\,}(d\log(S)) + \tfrac{1}{2}ZZ^T:dS^{-1} + 0 \cr &= \frac{1}{2}\Big(nS^{-1} - S^{-1}ZZ^TS^{-1}\Big):dS \cr &= \frac{1}{2}\Big(nS^{-1} - S^{-1}ZZ^TS^{-1}\Big):d(WW^T+P) \cr &= \frac{1}{2}\Big(nS^{-1} - S^{-1}ZZ^TS^{-1}\Big):(dW\,W^T+ W\,dW^T+dP) \cr }$$ Setting $dW=0$ yields the gradient wrt $P$ $$\eqalign{ dL &= \frac{1}{2}\Big(nS^{-1} - S^{-1}ZZ^TS^{-1}\Big):dP \cr \frac{\partial L}{\partial P} &= \frac{1}{2}\Big(nS^{-1} - S^{-1}ZZ^TS^{-1}\Big)\cr }$$ While setting $dP=0$ recovers the gradient wrt $W$ $$\eqalign{ dL &= \frac{1}{2}\Big(nS^{-1} - S^{-1}ZZ^TS^{-1}\Big):(dW\,W^T+ W\,dW^T) \cr &= \Big(nS^{-1} - S^{-1}ZZ^TS^{-1}\Big)W:dW \cr \frac{\partial L}{\partial W} &= \Big(nS^{-1} - S^{-1}ZZ^TS^{-1}\Big)W \cr }$$ In several of the steps, we've made use of the fact that $S$ is symmetric.

Derivation of $\frac{\partial }{\partial \Sigma} \left(-\frac{1}{2}\log(\det(M\otimes K +I_T \otimes \Sigma)) \right)$

Instead of vectorization, take advantage of the block structure of your matrix by introducing a block version of the diag() operator $$B_k={\rm bldiag}(M,k,n)$$ which extracts the $k^{th}$ block along the diagonal of $M\,$ where $1\le k\le n.$
The dimension of the block is $\tfrac{1}{n}$ of the corresponding dimension of the parent matrix.

Also note that for any value of $k,\,\,{\rm bldiag}\big((I_T\otimes \Sigma),k,T\big) = \Sigma,$
and further, these are the only non-zero blocks in the entire matrix.

Back to your specific problem. To reduce the clutter let's drop most subscripts, ignore the scalar factors, rename the variable $\Sigma\rightarrow S$ so it's not confused with summation, and rename $T\rightarrow n$ so as not to confuse it with the transpose operation. $$\eqalign{ U &= M\otimes K + I_n\otimes S \cr \phi &= \log\det U \cr d\phi &= d{\,\rm tr}(\log U) \cr &= U^{-T}:dU \cr &= U^{-T}:(I_n\otimes dS) \cr &= \sum_{k=1}^n{\rm bldiag}\big(U^{-T},k,n\big):{\rm bldiag}\big((I_n\otimes dS),k,n\big) \cr &= \sum_{k=1}^nB_k:dS \cr &= B:dS \cr \frac{\partial\phi}{\partial S} &= B \cr\cr }$$ The second problem is quite similar. $$\eqalign{ W &= vv^T \cr \psi &= -W:U^{-1} \cr d\psi &= W:U^{-1}\,dU\,U^{-1} \cr &= U^{-T}WU^{-T}:dU \cr &= \sum_{k=1}^n{\rm bldiag}\big(U^{-T}WU^{-T},k,n\big):dS \cr &= C:dS \cr \frac{\partial\psi}{\partial S} &= C \cr\cr }$$ For coding purposes, assume you have $$\eqalign{ A&\in{\mathbb R}^{pm\times pn} \cr }$$ and you wish to calculate the sum of the block diagonals, i.e. $$\eqalign{ B &= \sum_{k=1}^p{\rm bldiag}(A,k,p) \quad\in {\mathbb R}^{m\times n} \cr }$$ In almost all programming languages you can access a sub-matrix using index ranges, so you don't need to waste RAM creating vectors and matrices to hold intermediate results.

For example, in Julia (or Matlab) you can write

B = zeros(m,n)
for k = 1:p
  B += A[k*m-m+1:k*m, k*n-n+1:k*n]
end

So this single for-loop will calculate the gradients shown above.

Update
Here is a little Julia (v $0.6$) script to check the various "vec-kronecker" expansions.

s,t = 2,3; dS = 4*rand(s,s); dS += dS'; It = eye(t); Is = eye(s);
Iss = kron(Is,Is); M = kron(Is,It[:,1]);
for k = 2:t; M = [ M;  kron(Is,It[:,k]) ]; end
M = kron(M,Is);

x = kron(vec(It), Iss)*vec(dS);
y = vec(kron(It,dS));
z = M*vec(dS);

println( "x = $(x[1:9])\ny = $(y[1:9])\nz = $(z[1:9])" )
x = [3.07674, 5.3285, 5.3285, 3.51476, 0.0, 0.0, 0.0, 0.0, 0.0]
y = [3.07674, 5.3285, 0.0, 0.0, 0.0, 0.0, 5.3285, 3.51476, 0.0]
z = [3.07674, 5.3285, 0.0, 0.0, 0.0, 0.0, 5.3285, 3.51476, 0.0]

In symbols $$\eqalign{ M &= \pmatrix{I_s\otimes e_1\cr I_s\otimes e_2\cr\ldots\cr I_s\otimes e_t} \otimes I_s \cr x &= \big({\rm vec}(I_t)\otimes I_{s^2}\big)\,{\rm vec}(dS) \cr y &= {\rm vec}(I_t\otimes dS) \cr z &= M\,{\rm vec}(dS) \cr x &\ne y = z \cr }$$

Best Answer

Related Solutions

[Math] Derivation of derivative of multivariate Gaussian w.r.t. covariance matrix

Derivation of $\frac{\partial }{\partial \Sigma} \left(-\frac{1}{2}\log(\det(M\otimes K +I_T \otimes \Sigma)) \right)$

Related Question