So I'll preface by saying I'm not entirely sure whether the issue here is comfort with matrix inversion, or its interpretation for this statistical purpose.
That said I think a good way to approach the precision matrix, is through what we can do with it.
Conditionals
$ \bf z = \left[ \begin{matrix}
\bf x \\
\bf y
\end{matrix} \right]
\sim \mathcal N \left(
\left[ \begin{matrix}
\bf a \\
\bf b
\end{matrix} \right]
,
\left[ \begin{matrix}
\boldsymbol \Lambda_{aa} & \boldsymbol \Lambda_{ab}\\
\boldsymbol \Lambda_{ab}^T & \boldsymbol \Lambda_{bb}
\end{matrix} \right]^{-1}
\right) \\$
Then the conditional has a nice form:
$\bf x | \bf y \sim \mathcal N \left(
\bf a + \boldsymbol \Lambda_{aa}^{-1}\boldsymbol \Lambda_{ab}(\bf y - \bf b)
, \ \
\boldsymbol \Lambda_{aa}^{-1}
\right) \\$
Things to observe:
- If $\boldsymbol \Lambda_{ab}$ is zero, $\bf x$ is conditionally independent of $\bf y$. This is useful for modelling sitatuations where, for example, elements of $\bf z$ represent readings at different locations, but the value at each location is only linked to its close neighbours and not to locations far away.
- If $\bf x$ is univariate, $\boldsymbol \Lambda_{aa}^{-1}$ is very easy to find (1/scalar).
Reverse conditionals
Suppose $\bf A$ is a matrix of constants (often provided by a given covariance matrix), and we are given two random vectors along with a prior for one and a conditional/likelihood for the other. See Bishop PRML p.93:
$ \bf y \sim \mathcal N(\bf b, \ \ \boldsymbol \Lambda_{bb}^{-1} ) \\
\bf x | \bf y \sim \mathcal N \left(
\bf A \bf y + \bf a, \ \ \boldsymbol \Lambda_{aa}^{-1}
\right) \\ $
Then we can obtain the reverse conditional and the other marginal:
$ \bf x \sim \mathcal N(\bf A \bf b + \bf a,
\ \ \boldsymbol \Lambda_{aa}^{-1} + \bf A \boldsymbol \Lambda_{bb}^{-1} \bf A^T ) $
$ \bf y | \bf x \sim \mathcal N \left(
\left( \boldsymbol \Lambda_{bb} + \bf A^T \boldsymbol \Lambda_{aa} \bf A \right)^{-1}
\left( \bf A^T \boldsymbol \Lambda_{aa} (\bf x - \bf a ) + \boldsymbol \Lambda_{bb} \bf b \right)
, \ \ \left( \boldsymbol \Lambda_{bb} + \bf A^T \boldsymbol \Lambda_{aa} \bf A \right)^{-1}
\right) $
To see how we might use this suppose $\bf y$ is a common flight path used by planes, and $\bf x$ is the route taken by a particular plane. So that $\bf x | y$ is the route taken by a particular plane, given which path it is on. Then we would think about $\bf y | x$ if we had seen a particular plane flying and wanted to get a picture of what path the pilot was trying to follow.
If we take a simple case where $\bf A = I$ and $\Lambda_{aa}=a \bf I$,$ \Lambda_{bb}=b \bf I$ then:
$ \bf y | \bf x$$ \sim \mathcal N \left(
\left( b + a \right)^{-1}
\left( a \left(\bf{x} - \bf{a} \right) +
b \bf{ b} \right)
, \ \ \left( b + a \right)^{-1}
\right) $
At which point you might notice that if $b >> a$ then the mean of $\bf y$ is much closer to its marginal mean. Whereas if $a >> b$ the mean of $\bf y$ gets moved far more in the direction of the difference of $\bf x$ from its mean.
Hence if $\bf x$ is our observed data, and the information about $\bf y$ our prior, then if $b>>a$ we have a strong prior. That is the data would have to be extremely unusual in order to change our beliefs about the distribution of $\bf y$.
Derivations
I'd suggest checking out the Woodbury matrix identity.
https://github.com/pearcemc/gps/blob/master/MVNs.pdf
Why they represent covariance with 4 separated matrices?
I emphasize this each notion as matrix. what happen if each notion become a matrix
In this case the vectors ${\boldsymbol Y}$ and ${\boldsymbol \mu}$ are really block vectors. In the case of an $n$-dimensional ${\boldsymbol Y}$ vector we could expand it as follows:
$$\boldsymbol Y= \begin{bmatrix} \color{blue}{Y_1} \\ \color{red}{Y_2} \end{bmatrix}=\begin{bmatrix}\color{blue}{Y_{11}\\Y_{12}\\\vdots\\ Y_{1h}}\\\color{red}{Y_{21}\\Y_{22}\\\vdots\\ Y_{2k}}\end{bmatrix}\tag{$n \times 1$}$$
showing the partition of the $n$ coordinates into two groups of size $h$ and $k$, respectively, such that $n = h + k$. A parallel illustration would immediately follow for the $\boldsymbol \mu$ vector of population means.
The block matrix of covariances would hence follow as:
$$\begin{bmatrix}
\Sigma_{\color{blue}{11}} & \Sigma_{\color{blue}{1}\color{red}{2}}\\
\Sigma_{\color{red}{2}\color{blue}{1}} & \Sigma_{\color{red}{22}}
\end{bmatrix} \tag {$n \times n$}$$
where
$$\small\Sigma_{\color{blue}{11}}=\begin{bmatrix}
\sigma^2({\color{blue}{Y_{11}}}) & \text{cov}(\color{blue}{Y_{11},Y_{12}}) & \dots & \text{cov}(\color{blue}{Y_{11},Y_{1h}}) \\
\text{cov}(\color{blue}{Y_{12},Y_{11}}) & \sigma^2({\color{blue}{Y_{12}}}) & \dots & \text{cov}(\color{blue}{Y_{12},Y_{1h}}) \\
\vdots & \vdots & & \vdots \\
\text{cov}(\color{blue}{Y_{1h},Y_{11}}) & \text{cov}(\color{blue}{Y_{1h},Y_{12}}) &\dots& \sigma^2({\color{blue}{Y_{1h}}})
\end{bmatrix} \tag{$h \times h$}$$
with
$$\small \Sigma_{\color{blue}{1}\color{red}{2}}=
\begin{bmatrix}
\text{cov}({\color{blue}{Y_{11}}},\color{red}{Y_{21}}) & \text{cov}(\color{blue}{Y_{11}},\color{red}{Y_{22}}) & \dots & \text{cov}(\color{blue}{Y_{11}},\color{red}{Y_{2k}}) \\
\text{cov}({\color{blue}{Y_{12}}},\color{red}{Y_{21}}) & \text{cov}(\color{blue}{Y_{12}},\color{red}{Y_{22}}) & \dots & \text{cov}(\color{blue}{Y_{12}},\color{red}{Y_{2k}}) \\
\vdots & \vdots & & \vdots \\
\text{cov}({\color{blue}{Y_{1h}}},\color{red}{Y_{21}}) & \text{cov}(\color{blue}{Y_{1h}},\color{red}{Y_{22}}) & \dots & \text{cov}(\color{blue}{Y_{1h}},\color{red}{Y_{2k}})
\end{bmatrix}\tag{$h \times k$}
$$
its transpose...
$$\small \Sigma_{\color{red}{2}\color{blue}{1}}=
\begin{bmatrix}
\text{cov}({\color{red}{Y_{21}}},\color{blue}{Y_{11}}) & \text{cov}(\color{red}{Y_{21}},\color{blue}{Y_{12}}) & \dots & \text{cov}(\color{red}{Y_{21}},\color{blue}{Y_{1h}}) \\\text{cov}({\color{red}{Y_{22}}},\color{blue}{Y_{11}}) & \text{cov}(\color{red}{Y_{22}},\color{blue}{Y_{12}}) & \dots & \text{cov}(\color{red}{Y_{22}},\color{blue}{Y_{1h}}) \\
\vdots & \vdots & & \vdots \\
\text{cov}({\color{red}{Y_{2k}}},\color{blue}{Y_{11}}) & \text{cov}(\color{red}{Y_{2k}},\color{blue}{Y_{12}}) & \dots & \text{cov}(\color{red}{Y_{2k}},\color{blue}{Y_{1h}})
\end{bmatrix}\tag{$k \times h$}
$$
and
$$\small \Sigma_{\color{red}{22}}=\begin{bmatrix}
\sigma^2({\color{red}{Y_{21}}}) & \text{cov}(\color{red}{Y_{21},Y_{22}}) & \dots & \text{cov}(\color{red}{Y_{21},Y_{2k}}) \\
\text{cov}(\color{red}{Y_{22},Y_{21}}) & \sigma^2({\color{red}{Y_{22}}}) & \dots & \text{cov}(\color{red}{Y_{22},Y_{2k}}) \\
\vdots & \vdots & & \vdots \\
\text{cov}(\color{red}{Y_{2k},Y_{21}}) & \text{cov}(\color{red}{Y_{2k},Y_{22}}) &\dots& \sigma^2({\color{red}{Y_{2k}}})
\end{bmatrix} \tag{$k \times k$}$$
These partitions come into play in proving that the marginal distributions of a multivariate Gaussian are also Gaussian, as well as in the actual derivation of marginal and conditional pdf's.
Best Answer
You can prove it by explicitly calculating the conditional density by brute force, as in Procrastinator's link (+1) in the comments. But, there's also a theorem that says all conditional distributions of a multivariate normal distribution are normal. Therefore, all that's left is to calculate the mean vector and covariance matrix. I remember we derived this in a time series class in college by cleverly defining a third variable and using its properties to derive the result more simply than the brute force solution in the link (as long as you're comfortable with matrix algebra). I'm going from memory but it was something like this:
Let ${\bf x}_{1}$ be the first partition and ${\bf x}_2$ the second. Now define ${\bf z} = {\bf x}_1 + {\bf A} {\bf x}_2 $ where ${\bf A} = -\Sigma_{12} \Sigma^{-1}_{22}$. Now we can write
\begin{align*} {\rm cov}({\bf z}, {\bf x}_2) &= {\rm cov}( {\bf x}_{1}, {\bf x}_2 ) + {\rm cov}({\bf A}{\bf x}_2, {\bf x}_2) \\ &= \Sigma_{12} + {\bf A} {\rm var}({\bf x}_2) \\ &= \Sigma_{12} - \Sigma_{12} \Sigma^{-1}_{22} \Sigma_{22} \\ &= 0 \end{align*}
Therefore ${\bf z}$ and ${\bf x}_2$ are uncorrelated and, since they are jointly normal, they are independent. Now, clearly $E({\bf z}) = {\boldsymbol \mu}_1 + {\bf A} {\boldsymbol \mu}_2$, therefore it follows that
\begin{align*} E({\bf x}_1 | {\bf x}_2) &= E( {\bf z} - {\bf A} {\bf x}_2 | {\bf x}_2) \\ & = E({\bf z}|{\bf x}_2) - E({\bf A}{\bf x}_2|{\bf x}_2) \\ & = E({\bf z}) - {\bf A}{\bf x}_2 \\ & = {\boldsymbol \mu}_1 + {\bf A} ({\boldsymbol \mu}_2 - {\bf x}_2) \\ & = {\boldsymbol \mu}_1 + \Sigma_{12} \Sigma^{-1}_{22} ({\bf x}_2- {\boldsymbol \mu}_2) \end{align*}
which proves the first part. For the covariance matrix, note that
\begin{align*} {\rm var}({\bf x}_1|{\bf x}_2) &= {\rm var}({\bf z} - {\bf A} {\bf x}_2 | {\bf x}_2) \\ &= {\rm var}({\bf z}|{\bf x}_2) + {\rm var}({\bf A} {\bf x}_2 | {\bf x}_2) - {\bf A}{\rm cov}({\bf z}, -{\bf x}_2) - {\rm cov}({\bf z}, -{\bf x}_2) {\bf A}' \\ &= {\rm var}({\bf z}|{\bf x}_2) \\ &= {\rm var}({\bf z}) \end{align*}
Now we're almost done:
\begin{align*} {\rm var}({\bf x}_1|{\bf x}_2) = {\rm var}( {\bf z} ) &= {\rm var}( {\bf x}_1 + {\bf A} {\bf x}_2 ) \\ &= {\rm var}( {\bf x}_1 ) + {\bf A} {\rm var}( {\bf x}_2 ) {\bf A}' + {\bf A} {\rm cov}({\bf x}_1,{\bf x}_2) + {\rm cov}({\bf x}_2,{\bf x}_1) {\bf A}' \\ &= \Sigma_{11} +\Sigma_{12} \Sigma^{-1}_{22} \Sigma_{22}\Sigma^{-1}_{22}\Sigma_{21} - 2 \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21} \\ &= \Sigma_{11} +\Sigma_{12} \Sigma^{-1}_{22}\Sigma_{21} - 2 \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21} \\ &= \Sigma_{11} -\Sigma_{12} \Sigma^{-1}_{22}\Sigma_{21} \end{align*}
which proves the second part.
Note: For those not very familiar with the matrix algebra used here, this is an excellent resource.
Edit: One property used here this is not in the matrix cookbook (good catch @FlyingPig) is property 6 on the wikipedia page about covariance matrices: which is that for two random vectors $\bf x, y$, $${\rm var}({\bf x}+{\bf y}) = {\rm var}({\bf x})+{\rm var}({\bf y}) + {\rm cov}({\bf x},{\bf y}) + {\rm cov}({\bf y},{\bf x})$$ For scalars, of course, ${\rm cov}(X,Y)={\rm cov}(Y,X)$ but for vectors they are different insofar as the matrices are arranged differently.