You can prove it by explicitly calculating the conditional density by brute force, as in Procrastinator's link (+1) in the comments. But, there's also a theorem that says all conditional distributions of a multivariate normal distribution are normal. Therefore, all that's left is to calculate the mean vector and covariance matrix. I remember we derived this in a time series class in college by cleverly defining a third variable and using its properties to derive the result more simply than the brute force solution in the link (as long as you're comfortable with matrix algebra). I'm going from memory but it was something like this:
Let ${\bf x}_{1}$ be the first partition and ${\bf x}_2$ the second. Now define ${\bf z} = {\bf x}_1 + {\bf A} {\bf x}_2 $ where ${\bf A} = -\Sigma_{12} \Sigma^{-1}_{22}$. Now we can write
\begin{align*} {\rm cov}({\bf z}, {\bf x}_2) &= {\rm cov}( {\bf x}_{1}, {\bf x}_2 ) +
{\rm cov}({\bf A}{\bf x}_2, {\bf x}_2) \\
&= \Sigma_{12} + {\bf A} {\rm var}({\bf x}_2) \\
&= \Sigma_{12} - \Sigma_{12} \Sigma^{-1}_{22} \Sigma_{22} \\
&= 0
\end{align*}
Therefore ${\bf z}$ and ${\bf x}_2$ are uncorrelated and, since they are jointly normal, they are independent. Now, clearly $E({\bf z}) = {\boldsymbol \mu}_1 + {\bf A} {\boldsymbol \mu}_2$, therefore it follows that
\begin{align*}
E({\bf x}_1 | {\bf x}_2) &= E( {\bf z} - {\bf A} {\bf x}_2 | {\bf x}_2) \\
& = E({\bf z}|{\bf x}_2) - E({\bf A}{\bf x}_2|{\bf x}_2) \\
& = E({\bf z}) - {\bf A}{\bf x}_2 \\
& = {\boldsymbol \mu}_1 + {\bf A} ({\boldsymbol \mu}_2 - {\bf x}_2) \\
& = {\boldsymbol \mu}_1 + \Sigma_{12} \Sigma^{-1}_{22} ({\bf x}_2- {\boldsymbol \mu}_2)
\end{align*}
which proves the first part. For the covariance matrix, note that
\begin{align*}
{\rm var}({\bf x}_1|{\bf x}_2) &= {\rm var}({\bf z} - {\bf A} {\bf x}_2 | {\bf x}_2) \\
&= {\rm var}({\bf z}|{\bf x}_2) + {\rm var}({\bf A} {\bf x}_2 | {\bf x}_2) - {\bf A}{\rm cov}({\bf z}, -{\bf x}_2) - {\rm cov}({\bf z}, -{\bf x}_2) {\bf A}' \\
&= {\rm var}({\bf z}|{\bf x}_2) \\
&= {\rm var}({\bf z})
\end{align*}
Now we're almost done:
\begin{align*}
{\rm var}({\bf x}_1|{\bf x}_2) = {\rm var}( {\bf z} ) &= {\rm var}( {\bf x}_1 + {\bf A} {\bf x}_2 ) \\
&= {\rm var}( {\bf x}_1 ) + {\bf A} {\rm var}( {\bf x}_2 ) {\bf A}'
+ {\bf A} {\rm cov}({\bf x}_1,{\bf x}_2) + {\rm cov}({\bf x}_2,{\bf x}_1) {\bf A}' \\
&= \Sigma_{11} +\Sigma_{12} \Sigma^{-1}_{22} \Sigma_{22}\Sigma^{-1}_{22}\Sigma_{21}
- 2 \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21} \\
&= \Sigma_{11} +\Sigma_{12} \Sigma^{-1}_{22}\Sigma_{21}
- 2 \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21} \\
&= \Sigma_{11} -\Sigma_{12} \Sigma^{-1}_{22}\Sigma_{21}
\end{align*}
which proves the second part.
Note: For those not very familiar with the matrix algebra used here, this is an excellent resource.
Edit: One property used here this is not in the matrix cookbook (good catch @FlyingPig) is property 6 on the wikipedia page about covariance matrices: which is that for two random vectors $\bf x, y$, $${\rm var}({\bf x}+{\bf y}) = {\rm var}({\bf x})+{\rm var}({\bf y}) + {\rm cov}({\bf x},{\bf y}) + {\rm cov}({\bf y},{\bf x})$$ For scalars, of course, ${\rm cov}(X,Y)={\rm cov}(Y,X)$ but for vectors they are different insofar as the matrices are arranged differently.
The problem is in the matrix differentiation. As the covariance matrix is symmetric, we have
$
\frac{\partial l}{\partial \Sigma}=-\Sigma^{-1}+\frac{diag(\Sigma^{-1})}{2}+\Sigma^{-1}(x-\mu)(x-\mu)'\Sigma^{-1}-\frac{diag(\Sigma^{-1}(x-\mu)(x-\mu)'\Sigma^{-1})}{2}
$
where $l$ is the log-likelihood function.
Best Answer
In this case the vectors ${\boldsymbol Y}$ and ${\boldsymbol \mu}$ are really block vectors. In the case of an $n$-dimensional ${\boldsymbol Y}$ vector we could expand it as follows:
$$\boldsymbol Y= \begin{bmatrix} \color{blue}{Y_1} \\ \color{red}{Y_2} \end{bmatrix}=\begin{bmatrix}\color{blue}{Y_{11}\\Y_{12}\\\vdots\\ Y_{1h}}\\\color{red}{Y_{21}\\Y_{22}\\\vdots\\ Y_{2k}}\end{bmatrix}\tag{$n \times 1$}$$
showing the partition of the $n$ coordinates into two groups of size $h$ and $k$, respectively, such that $n = h + k$. A parallel illustration would immediately follow for the $\boldsymbol \mu$ vector of population means.
The block matrix of covariances would hence follow as:
$$\begin{bmatrix} \Sigma_{\color{blue}{11}} & \Sigma_{\color{blue}{1}\color{red}{2}}\\ \Sigma_{\color{red}{2}\color{blue}{1}} & \Sigma_{\color{red}{22}} \end{bmatrix} \tag {$n \times n$}$$
where
$$\small\Sigma_{\color{blue}{11}}=\begin{bmatrix} \sigma^2({\color{blue}{Y_{11}}}) & \text{cov}(\color{blue}{Y_{11},Y_{12}}) & \dots & \text{cov}(\color{blue}{Y_{11},Y_{1h}}) \\ \text{cov}(\color{blue}{Y_{12},Y_{11}}) & \sigma^2({\color{blue}{Y_{12}}}) & \dots & \text{cov}(\color{blue}{Y_{12},Y_{1h}}) \\ \vdots & \vdots & & \vdots \\ \text{cov}(\color{blue}{Y_{1h},Y_{11}}) & \text{cov}(\color{blue}{Y_{1h},Y_{12}}) &\dots& \sigma^2({\color{blue}{Y_{1h}}}) \end{bmatrix} \tag{$h \times h$}$$
with
$$\small \Sigma_{\color{blue}{1}\color{red}{2}}= \begin{bmatrix} \text{cov}({\color{blue}{Y_{11}}},\color{red}{Y_{21}}) & \text{cov}(\color{blue}{Y_{11}},\color{red}{Y_{22}}) & \dots & \text{cov}(\color{blue}{Y_{11}},\color{red}{Y_{2k}}) \\ \text{cov}({\color{blue}{Y_{12}}},\color{red}{Y_{21}}) & \text{cov}(\color{blue}{Y_{12}},\color{red}{Y_{22}}) & \dots & \text{cov}(\color{blue}{Y_{12}},\color{red}{Y_{2k}}) \\ \vdots & \vdots & & \vdots \\ \text{cov}({\color{blue}{Y_{1h}}},\color{red}{Y_{21}}) & \text{cov}(\color{blue}{Y_{1h}},\color{red}{Y_{22}}) & \dots & \text{cov}(\color{blue}{Y_{1h}},\color{red}{Y_{2k}}) \end{bmatrix}\tag{$h \times k$} $$
its transpose...
$$\small \Sigma_{\color{red}{2}\color{blue}{1}}= \begin{bmatrix} \text{cov}({\color{red}{Y_{21}}},\color{blue}{Y_{11}}) & \text{cov}(\color{red}{Y_{21}},\color{blue}{Y_{12}}) & \dots & \text{cov}(\color{red}{Y_{21}},\color{blue}{Y_{1h}}) \\\text{cov}({\color{red}{Y_{22}}},\color{blue}{Y_{11}}) & \text{cov}(\color{red}{Y_{22}},\color{blue}{Y_{12}}) & \dots & \text{cov}(\color{red}{Y_{22}},\color{blue}{Y_{1h}}) \\ \vdots & \vdots & & \vdots \\ \text{cov}({\color{red}{Y_{2k}}},\color{blue}{Y_{11}}) & \text{cov}(\color{red}{Y_{2k}},\color{blue}{Y_{12}}) & \dots & \text{cov}(\color{red}{Y_{2k}},\color{blue}{Y_{1h}}) \end{bmatrix}\tag{$k \times h$} $$
and
$$\small \Sigma_{\color{red}{22}}=\begin{bmatrix} \sigma^2({\color{red}{Y_{21}}}) & \text{cov}(\color{red}{Y_{21},Y_{22}}) & \dots & \text{cov}(\color{red}{Y_{21},Y_{2k}}) \\ \text{cov}(\color{red}{Y_{22},Y_{21}}) & \sigma^2({\color{red}{Y_{22}}}) & \dots & \text{cov}(\color{red}{Y_{22},Y_{2k}}) \\ \vdots & \vdots & & \vdots \\ \text{cov}(\color{red}{Y_{2k},Y_{21}}) & \text{cov}(\color{red}{Y_{2k},Y_{22}}) &\dots& \sigma^2({\color{red}{Y_{2k}}}) \end{bmatrix} \tag{$k \times k$}$$
These partitions come into play in proving that the marginal distributions of a multivariate Gaussian are also Gaussian, as well as in the actual derivation of marginal and conditional pdf's.