Multivariate Normal Distribution – Deriving Conditional Distributions

conditional probabilitynormal distribution

We have a multivariate normal vector ${\boldsymbol Y} \sim \mathcal{N}(\boldsymbol\mu, \Sigma)$. Consider partitioning $\boldsymbol\mu$ and ${\boldsymbol Y}$ into
$$\boldsymbol\mu
=
\begin{bmatrix}
\boldsymbol\mu_1 \\
\boldsymbol\mu_2
\end{bmatrix}
$$
$${\boldsymbol Y}=\begin{bmatrix}{\boldsymbol y}_1 \\
{\boldsymbol y}_2 \end{bmatrix}$$

with a similar partition of $\Sigma$ into
$$
\begin{bmatrix}
\Sigma_{11} & \Sigma_{12}\\
\Sigma_{21} & \Sigma_{22}
\end{bmatrix}
$$
Then, $({\boldsymbol y}_1|{\boldsymbol y}_2={\boldsymbol a})$, the conditional distribution of the first partition given the second, is
$\mathcal{N}(\overline{\boldsymbol\mu},\overline{\Sigma})$, with mean
$$
\overline{\boldsymbol\mu}=\boldsymbol\mu_1+\Sigma_{12}{\Sigma_{22}}^{-1}({\boldsymbol a}-\boldsymbol\mu_2)
$$
and covariance matrix
$$
\overline{\Sigma}=\Sigma_{11}-\Sigma_{12}{\Sigma_{22}}^{-1}\Sigma_{21}$$

Actually these results are provided in Wikipedia too, but I have no idea how the $\overline{\boldsymbol\mu}$ and $\overline{\Sigma}$ is derived. These results are crucial, since they are important statistical formula for deriving Kalman filters. Would anyone provide me a derivation steps of deriving $\overline{\boldsymbol\mu}$ and $\overline{\Sigma}$ ? Thank you very much!

Best Answer

You can prove it by explicitly calculating the conditional density by brute force, as in Procrastinator's link (+1) in the comments. But, there's also a theorem that says all conditional distributions of a multivariate normal distribution are normal. Therefore, all that's left is to calculate the mean vector and covariance matrix. I remember we derived this in a time series class in college by cleverly defining a third variable and using its properties to derive the result more simply than the brute force solution in the link (as long as you're comfortable with matrix algebra). I'm going from memory but it was something like this:


Let ${\bf x}_{1}$ be the first partition and ${\bf x}_2$ the second. Now define ${\bf z} = {\bf x}_1 + {\bf A} {\bf x}_2 $ where ${\bf A} = -\Sigma_{12} \Sigma^{-1}_{22}$. Now we can write

\begin{align*} {\rm cov}({\bf z}, {\bf x}_2) &= {\rm cov}( {\bf x}_{1}, {\bf x}_2 ) + {\rm cov}({\bf A}{\bf x}_2, {\bf x}_2) \\ &= \Sigma_{12} + {\bf A} {\rm var}({\bf x}_2) \\ &= \Sigma_{12} - \Sigma_{12} \Sigma^{-1}_{22} \Sigma_{22} \\ &= 0 \end{align*}

Therefore ${\bf z}$ and ${\bf x}_2$ are uncorrelated and, since they are jointly normal, they are independent. Now, clearly $E({\bf z}) = {\boldsymbol \mu}_1 + {\bf A} {\boldsymbol \mu}_2$, therefore it follows that

\begin{align*} E({\bf x}_1 | {\bf x}_2) &= E( {\bf z} - {\bf A} {\bf x}_2 | {\bf x}_2) \\ & = E({\bf z}|{\bf x}_2) - E({\bf A}{\bf x}_2|{\bf x}_2) \\ & = E({\bf z}) - {\bf A}{\bf x}_2 \\ & = {\boldsymbol \mu}_1 + {\bf A} ({\boldsymbol \mu}_2 - {\bf x}_2) \\ & = {\boldsymbol \mu}_1 + \Sigma_{12} \Sigma^{-1}_{22} ({\bf x}_2- {\boldsymbol \mu}_2) \end{align*}

which proves the first part. For the covariance matrix, note that

\begin{align*} {\rm var}({\bf x}_1|{\bf x}_2) &= {\rm var}({\bf z} - {\bf A} {\bf x}_2 | {\bf x}_2) \\ &= {\rm var}({\bf z}|{\bf x}_2) + {\rm var}({\bf A} {\bf x}_2 | {\bf x}_2) - {\bf A}{\rm cov}({\bf z}, -{\bf x}_2) - {\rm cov}({\bf z}, -{\bf x}_2) {\bf A}' \\ &= {\rm var}({\bf z}|{\bf x}_2) \\ &= {\rm var}({\bf z}) \end{align*}

Now we're almost done:

\begin{align*} {\rm var}({\bf x}_1|{\bf x}_2) = {\rm var}( {\bf z} ) &= {\rm var}( {\bf x}_1 + {\bf A} {\bf x}_2 ) \\ &= {\rm var}( {\bf x}_1 ) + {\bf A} {\rm var}( {\bf x}_2 ) {\bf A}' + {\bf A} {\rm cov}({\bf x}_1,{\bf x}_2) + {\rm cov}({\bf x}_2,{\bf x}_1) {\bf A}' \\ &= \Sigma_{11} +\Sigma_{12} \Sigma^{-1}_{22} \Sigma_{22}\Sigma^{-1}_{22}\Sigma_{21} - 2 \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21} \\ &= \Sigma_{11} +\Sigma_{12} \Sigma^{-1}_{22}\Sigma_{21} - 2 \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21} \\ &= \Sigma_{11} -\Sigma_{12} \Sigma^{-1}_{22}\Sigma_{21} \end{align*}

which proves the second part.

Note: For those not very familiar with the matrix algebra used here, this is an excellent resource.

Edit: One property used here this is not in the matrix cookbook (good catch @FlyingPig) is property 6 on the wikipedia page about covariance matrices: which is that for two random vectors $\bf x, y$, $${\rm var}({\bf x}+{\bf y}) = {\rm var}({\bf x})+{\rm var}({\bf y}) + {\rm cov}({\bf x},{\bf y}) + {\rm cov}({\bf y},{\bf x})$$ For scalars, of course, ${\rm cov}(X,Y)={\rm cov}(Y,X)$ but for vectors they are different insofar as the matrices are arranged differently.

Related Question