Solved – Understanding the marginal distribution of multivariate normal distribution

matrixmatrix inversenormal distribution

I am trying to better understand the multivariate normal distribution.

Here I try to refer to the conditional distribution part of wiki also the fifth page of this tutorial.

I do not quite understand what does the inverse of covariance matrix actually mean.

\begin{equation}
\mu=\begin{bmatrix}
\mu_{1} \\
\mu_{2} \\
\end{bmatrix}
\end{equation}

\begin{equation}
\Sigma=\begin{bmatrix}
\Sigma_{11} \Sigma_{12}\\
\Sigma_{21} \Sigma_{22}\\
\end{bmatrix}
\end{equation}

\begin{equation}
\Lambda=\Sigma^{-1}=
\begin{bmatrix}
\Lambda_{11} \Lambda_{12}\\
\Lambda_{21} \Lambda_{22}\\
\end{bmatrix}
\end{equation}

The distribution of $x_{1}$ conditional on $x_{2}$ is multivariate normal distribution, and we can get mean and variance using the partition of the
covariance matrix and the inverse covariance matrix.

But my problem is, what does the inverse of covariance matrix mean? I do not quite understand this inverse operation, and also cannot understand how to derive the formula for mean and variance of the conditional probability.

I found a similar question here. What does the inverse of covariance matrix say about data? (Intuitively)
But still do not quite understand why partition here and how the conditional prob is derived.

Best Answer

So I'll preface by saying I'm not entirely sure whether the issue here is comfort with matrix inversion, or its interpretation for this statistical purpose.

That said I think a good way to approach the precision matrix, is through what we can do with it.

Conditionals

$ \bf z = \left[ \begin{matrix} \bf x \\ \bf y \end{matrix} \right] \sim \mathcal N \left( \left[ \begin{matrix} \bf a \\ \bf b \end{matrix} \right] , \left[ \begin{matrix} \boldsymbol \Lambda_{aa} & \boldsymbol \Lambda_{ab}\\ \boldsymbol \Lambda_{ab}^T & \boldsymbol \Lambda_{bb} \end{matrix} \right]^{-1} \right) \\$

Then the conditional has a nice form:

$\bf x | \bf y \sim \mathcal N \left( \bf a + \boldsymbol \Lambda_{aa}^{-1}\boldsymbol \Lambda_{ab}(\bf y - \bf b) , \ \ \boldsymbol \Lambda_{aa}^{-1} \right) \\$

Things to observe:

  1. If $\boldsymbol \Lambda_{ab}$ is zero, $\bf x$ is conditionally independent of $\bf y$. This is useful for modelling sitatuations where, for example, elements of $\bf z$ represent readings at different locations, but the value at each location is only linked to its close neighbours and not to locations far away.
  2. If $\bf x$ is univariate, $\boldsymbol \Lambda_{aa}^{-1}$ is very easy to find (1/scalar).

Reverse conditionals

Suppose $\bf A$ is a matrix of constants (often provided by a given covariance matrix), and we are given two random vectors along with a prior for one and a conditional/likelihood for the other. See Bishop PRML p.93:

$ \bf y \sim \mathcal N(\bf b, \ \ \boldsymbol \Lambda_{bb}^{-1} ) \\ \bf x | \bf y \sim \mathcal N \left( \bf A \bf y + \bf a, \ \ \boldsymbol \Lambda_{aa}^{-1} \right) \\ $

Then we can obtain the reverse conditional and the other marginal: $ \bf x \sim \mathcal N(\bf A \bf b + \bf a, \ \ \boldsymbol \Lambda_{aa}^{-1} + \bf A \boldsymbol \Lambda_{bb}^{-1} \bf A^T ) $

$ \bf y | \bf x \sim \mathcal N \left( \left( \boldsymbol \Lambda_{bb} + \bf A^T \boldsymbol \Lambda_{aa} \bf A \right)^{-1} \left( \bf A^T \boldsymbol \Lambda_{aa} (\bf x - \bf a ) + \boldsymbol \Lambda_{bb} \bf b \right) , \ \ \left( \boldsymbol \Lambda_{bb} + \bf A^T \boldsymbol \Lambda_{aa} \bf A \right)^{-1} \right) $

To see how we might use this suppose $\bf y$ is a common flight path used by planes, and $\bf x$ is the route taken by a particular plane. So that $\bf x | y$ is the route taken by a particular plane, given which path it is on. Then we would think about $\bf y | x$ if we had seen a particular plane flying and wanted to get a picture of what path the pilot was trying to follow.

If we take a simple case where $\bf A = I$ and $\Lambda_{aa}=a \bf I$,$ \Lambda_{bb}=b \bf I$ then:

$ \bf y | \bf x$$ \sim \mathcal N \left( \left( b + a \right)^{-1} \left( a \left(\bf{x} - \bf{a} \right) + b \bf{ b} \right) , \ \ \left( b + a \right)^{-1} \right) $

At which point you might notice that if $b >> a$ then the mean of $\bf y$ is much closer to its marginal mean. Whereas if $a >> b$ the mean of $\bf y$ gets moved far more in the direction of the difference of $\bf x$ from its mean.

Hence if $\bf x$ is our observed data, and the information about $\bf y$ our prior, then if $b>>a$ we have a strong prior. That is the data would have to be extremely unusual in order to change our beliefs about the distribution of $\bf y$.

Derivations

I'd suggest checking out the Woodbury matrix identity.

https://github.com/pearcemc/gps/blob/master/MVNs.pdf

Related Question