Solved – Derivation of Normal-Wishart posterior

bayesianposterior

I am working on the derivation of a Normal-Wishart posterior but I'm stuck at one of the parameters (the posterior of the scale matrix, see at the bottom).

Just for context and completeness, here is the model and the rest of the derivations:

\begin{align}
x_i &\sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Lambda})\\
\boldsymbol{\mu} &\sim \mathcal{N}(\boldsymbol{\mu_0}, (\kappa_0 \boldsymbol{\Lambda})^{-1})\\
\boldsymbol{\Lambda} &\sim \mathcal{W}(\upsilon_0, \mathbf{W}_0)
\end{align}

The expanded forms of each of the three factors is (up to a proportionality constant) are:

  • Likelihood:
    \begin{align}
    \mathcal{N}(\mathbf{x}_i &| \boldsymbol{\mu}, \boldsymbol{\Lambda})
    \propto\notag\\
    &|\boldsymbol{\Lambda}|^{N/2}
    \exp{\left(-\frac{1}{2}\sum_{i=1}^N \left( \mathbf{x}_i^T\boldsymbol{\Lambda}\mathbf{x}_i – 2 \boldsymbol{\mu}^T \boldsymbol{\Lambda}\mathbf{x}_i + \boldsymbol{\mu}^T\boldsymbol{\Lambda}\boldsymbol{\mu}\right) \right)}
    \end{align}

  • Normal prior:
    \begin{align}
    \mathcal{N}(\boldsymbol{\mu} &| (\boldsymbol{\mu}_0, \kappa_0 \boldsymbol{\Lambda})^{-1})
    \propto\notag\\
    &|\boldsymbol{\Lambda}|^{1/2}
    \exp{\left(-\frac{1}{2}\left( \boldsymbol{\mu}^T\kappa_0 \boldsymbol{\Lambda}\boldsymbol{\mu}
    – 2 \boldsymbol{\mu}^T \kappa_0 \boldsymbol{\Lambda}\boldsymbol{\mu_0} +
    \boldsymbol{\mu_0}^T \kappa_0 \boldsymbol{\Lambda}\boldsymbol{\mu_0}\right) \right)}
    \end{align}

  • Wishart prior:
    \begin{align}
    \mathcal{W}(\boldsymbol{\Lambda} | \upsilon_0, \mathbf{W}_0)
    \propto
    |\boldsymbol{\Lambda}|^{\frac{\upsilon_0-D-1}{2}}
    \exp{\left(-\frac{1}{2} tr(\mathbf{W}_0^{-1} \boldsymbol{\Lambda})\right)}
    \end{align}

We want the posterior Normal-Wishart($\mathbf{\mu}, \boldsymbol{\Lambda} | \boldsymbol{\mu}', \kappa', \upsilon', \mathbf{W}'$) which can be decomposed as well as $\mathcal{N}(\boldsymbol{\mu} | \boldsymbol{\mu}, \kappa' \boldsymbol{\Lambda}) \mathcal{W}(\boldsymbol{\Lambda} | \upsilon', \mathbf{W}')$:

Degress of freedom $\upsilon'$

By merging the first factors of the likelihood and the Wishart we get the first factor of the Wishart factor in the posterior:
\begin{align}
|\boldsymbol{\Lambda}|^{\frac{\upsilon_0+N-D-1}{2}}
\end{align}
and therefore we have the first parameter of the posterior:
\begin{align}
\upsilon' = \upsilon_0 +N
\end{align}

Scale factor $\kappa'$

We identify the elements surrounded by $\boldsymbol{\mu}^T$ and $\boldsymbol{\mu}$ to find who the prior $\kappa_0 \boldsymbol{\Lambda}$ is updated by the likelihood:
\begin{align}
\boldsymbol{\mu}^T\left((\kappa_0 + N) \boldsymbol{\Lambda}\right)\boldsymbol{\mu}
\end{align}
and therefore we got the second parameter:
\begin{align}
\kappa' = \kappa_0 + N
\end{align}

Mean $\boldsymbol{\mu}'$

The third parameters comes from identifying what is inside $2\boldsymbol{\mu}^T…$:
\begin{align}
2\boldsymbol{\mu}^T\left(\boldsymbol{\Lambda} N \mathbf{\overline{x} + \kappa_0 \boldsymbol{\Lambda}\boldsymbol{\mu}_0}\right)
&=
2\boldsymbol{\mu}^T \kappa'\boldsymbol{\Lambda} \boldsymbol{\mu}'\\
\left(\boldsymbol{\Lambda} N \mathbf{\overline{x} + \kappa_0 \boldsymbol{\Lambda}\boldsymbol{\mu}_0}\right)
&= \kappa'\boldsymbol{\Lambda} \boldsymbol{\mu}'\\
\left(N \mathbf{\overline{x} + \kappa_0 \boldsymbol{\mu}_0}\right)
&= \kappa' \boldsymbol{\mu}'
\end{align}
And therefore we got the third parameter:
\begin{align}
\boldsymbol{\mu}' = \frac{1}{k'}(N\mathbf{\overline{x}} + \kappa_0 \boldsymbol{\mu}_0)
\end{align}

Scale matrix $\boldsymbol{W}'$

And the fourth parameter comes from working on the remaining parameters:
\begin{align}
tr(\mathbf{W}'^{-1} \boldsymbol{\Lambda})
&= tr(\mathbf{W}_0^{-1} \boldsymbol{\Lambda}) + \sum_{i=1}^{N}\mathbf{x}_i^T\boldsymbol{\Lambda}\mathbf{x}_i +
\boldsymbol{\mu_0}^T \kappa_0 \boldsymbol{\Lambda}\boldsymbol{\mu_0}\\
&= tr(\mathbf{W}_0^{-1} \boldsymbol{\Lambda}) + \sum_{i=1}^{N}tr(\mathbf{x}_i^T\boldsymbol{\Lambda}\mathbf{x}_i) +
tr(\boldsymbol{\mu_0}^T \kappa_0 \boldsymbol{\Lambda}\boldsymbol{\mu_0})\\
&= tr\bigg(\mathbf{W}_0^{-1} \boldsymbol{\Lambda} + \sum_{i=1}^{N}\mathbf{x}_i^T\boldsymbol{\Lambda}\mathbf{x}_i +
\boldsymbol{\mu_0}^T \kappa_0 \boldsymbol{\Lambda}\boldsymbol{\mu_0}\bigg)
\end{align}

How to go on from here (if I made no mistakes so far) and get the standard solution for $\mathbf{W}'$?

Edit 1:

Now we re-arrange the terms, add and substract some factors in order to get two squares as in the standard solution:
\begin{align}
tr(\mathbf{W}'^{-1}\boldsymbol{\Lambda})
=\;&
tr\bigg(\mathbf{W}^{-1} \boldsymbol{\Lambda}\\
&+ \sum_{i=1}^N (\mathbf{x}_i^T \boldsymbol{\Lambda} \mathbf{x}_i
+ \overline{\mathbf{x}}^T \boldsymbol{\Lambda} \overline{\mathbf{x}}
-2 \mathbf{x}_i^T \boldsymbol{\Lambda} \mathbf{\overline{x}})\\
&+\kappa_0
(\boldsymbol{\mu}_0^T \boldsymbol{\Lambda} \boldsymbol{\mu}_0
+ \boldsymbol{\overline{x}}^T \boldsymbol{\Lambda}\boldsymbol{\overline{x}}
– 2\boldsymbol{\overline{x}}^T \boldsymbol{\Lambda} \boldsymbol{\mu}_0)
\notag\\% Substract added factors
&-\sum_{i=1}^{N}\overline{\mathbf{x}}^T \boldsymbol{\Lambda} \overline{\mathbf{x}}
+2\sum_{i=1}^{N}\mathbf{x}_i^T \boldsymbol{\Lambda} \mathbf{\overline{x}}
– \kappa_0\boldsymbol{\overline{x}}^T \boldsymbol{\Lambda} \boldsymbol{\overline{x}}
+ 2\kappa_0\boldsymbol{\overline{x}}^T \boldsymbol{\Lambda}\boldsymbol{\mu}_0\bigg)\\
=\;&
tr\bigg(
\mathbf{W}^{-1} \boldsymbol{\Lambda}+
\sum_{i=1}^N (\mathbf{x}_i – \overline{\mathbf{x}})\boldsymbol{\Lambda}(\mathbf{x}_i – \overline{\mathbf{x}})^T
+ \kappa_0(\boldsymbol{\overline{x}} – \boldsymbol{\mu}_0)\boldsymbol{\Lambda}(\boldsymbol{\overline{x}} – \boldsymbol{\mu}_0)^T
\notag\\% Substract added factors
&-N\overline{\mathbf{x}} \boldsymbol{\Lambda} \overline{\mathbf{x}}^T
+2N\mathbf{\overline{x}} \boldsymbol{\Lambda}\mathbf{\overline{x}}^T
– \kappa_0\boldsymbol{\overline{x}}\boldsymbol{\Lambda}\boldsymbol{\overline{x}}^T
+ 2\kappa_0\boldsymbol{\overline{x}}\boldsymbol{\Lambda} \boldsymbol{\mu}_0^T
\bigg)
\end{align}

We simplify the factors that remain out of the squares:
\begin{align}
tr(\mathbf{W}'^{-1} \boldsymbol{\Lambda})
=\;&
tr(
\mathbf{W}^{-1} \boldsymbol{\Lambda}+
\sum_{i=1}^N (\mathbf{x}_i – \overline{\mathbf{x}})^T\boldsymbol{\Lambda}(\mathbf{x}_i – \overline{\mathbf{x}})
+ \kappa_0(\boldsymbol{\overline{x}} – \boldsymbol{\mu}_0)^T\boldsymbol{\Lambda}(\boldsymbol{\overline{x}} – \boldsymbol{\mu}_0)
\notag\\% Substract added factors
&
+(N – \kappa_0)\mathbf{\overline{x}}^T\boldsymbol{\Lambda} \mathbf{\overline{x}}
+ 2\kappa_0\boldsymbol{\overline{x}}^T\boldsymbol{\Lambda}\boldsymbol{\mu}_0
\bigg)
\end{align}

Edit 2 (follow up thanks to @bdeonovic 's answer)

The trace is cyclic, so $tr(ABC) = tr(BCA) = tr(CAB)$. Then:
\begin{align}
tr(\mathbf{W}'^{-1} \boldsymbol{\Lambda})
=\;&
tr\bigg(
\mathbf{W}^{-1} \boldsymbol{\Lambda}+
\sum_{i=1}^N (\mathbf{x}_i – \overline{\mathbf{x}})(\mathbf{x}_i – \overline{\mathbf{x}})^T\boldsymbol{\Lambda}
+ \kappa_0(\boldsymbol{\overline{x}} – \boldsymbol{\mu}_0)(\boldsymbol{\overline{x}} – \boldsymbol{\mu}_0)^T\boldsymbol{\Lambda}
\notag\\% Substract added factors
&
+(N – \kappa_0)\mathbf{\overline{x}} \mathbf{\overline{x}}^T\boldsymbol{\Lambda}
+ 2\kappa_0\boldsymbol{\overline{x}}\boldsymbol{\mu}_0^T \boldsymbol{\Lambda}
\bigg)
\end{align}
and then:
\begin{align}
tr(\mathbf{W}'^{-1})
=\;&
tr\bigg(
\mathbf{W}^{-1}+
\sum_{i=1}^N (\mathbf{x}_i – \overline{\mathbf{x}})(\mathbf{x}_i – \overline{\mathbf{x}})^T
+ \kappa_0(\boldsymbol{\overline{x}} – \boldsymbol{\mu}_0)(\boldsymbol{\overline{x}} – \boldsymbol{\mu}_0)^T
\notag\\% Substract added factors
&
+(N – \kappa_0)\mathbf{\overline{x}} \mathbf{\overline{x}}^T
+ 2\kappa_0\boldsymbol{\overline{x}}\boldsymbol{\mu}_0^T
\bigg)
\end{align}

Almost! But still not there. The goal is:
\begin{align}
\mathbf{W}^{-1} +
\sum_{i=1}^N (\mathbf{x}_i – \overline{\mathbf{x}})(\mathbf{x}_i – \overline{\mathbf{x}})^T
+ \frac{\kappa_0 N}{\kappa_0 + N}(\boldsymbol{\overline{x}} – \boldsymbol{\mu}_0)(\boldsymbol{\overline{x}} – \boldsymbol{\mu}_0)^T
\end{align}

Best Answer

The trace is cyclic, so $tr(ABC) = tr(BCA) = tr(CAB)$. Also the trace distributes over addition so that $tr(A+B) = tr(A) + tr(B)$. With these facts you should be able to cycle the $\Lambda$ term around to the back in the trace terms, combine the trace terms together. The result should look something like $$W'^{-1} = W^{-1} + \sum_{i=1}^N x_i x_i^\intercal + \mu_0\mu_0^\intercal$$