With the distributions on our random vectors:
$\mathbf x_i | \mathbf \mu \sim N(\mu , \mathbf \Sigma)$
$\mathbf \mu \sim N(\mathbf \mu_0, \mathbf \Sigma_0)$
By Bayes's rule the posterior distribution looks like:
$p(\mu| \{\mathbf x_i\}) \propto p(\mu) \prod_{i=1}^N p(\mathbf x_i | \mu)$
So:
$\ln p(\mu| \{\mathbf x_i\}) = -\frac{1}{2}\sum_{i=1}^N(\mathbf x_i - \mu)'\mathbf \Sigma^{-1}(\mathbf x_i - \mu) -\frac{1}{2}(\mu - \mu_0)'\mathbf \Sigma_0^{-1}(\mu - \mu_0) + const$
$ = -\frac{1}{2} N \mu' \mathbf \Sigma^{-1} \mu + \sum_{i=1}^N \mu' \mathbf \Sigma^{-1} \mathbf x_i -\frac{1}{2} \mu' \mathbf \Sigma_0^{-1} \mu + \mu' \mathbf \Sigma_0^{-1} \mu_0 + const$
$ = -\frac{1}{2} \mu' (N \mathbf \Sigma^{-1} + \mathbf \Sigma_0^{-1}) \mu + \mu' (\mathbf \Sigma_0^{-1} \mu_0 + \mathbf \Sigma^{-1} \sum_{i=1}^N \mathbf x_i) + const$
$= -\frac{1}{2}(\mu - (N \mathbf \Sigma^{-1} + \mathbf \Sigma_0^{-1})^{-1}(\mathbf \Sigma_0^{-1} \mu_0 + \mathbf \Sigma^{-1} \sum_{i=1}^N \mathbf x_i))' (N \mathbf \Sigma^{-1} + \mathbf \Sigma_0^{-1}) (\mu - (N \mathbf \Sigma^{-1} + \mathbf \Sigma_0^{-1})^{-1}(\mathbf \Sigma_0^{-1} \mu_0 + \mathbf \Sigma^{-1} \sum_{i=1}^N \mathbf x_i)) + const$
Which is the log density of a Gaussian:
$\mu| \{\mathbf x_i\} \sim N((N \mathbf \Sigma^{-1} + \mathbf \Sigma_0^{-1})^{-1}(\mathbf \Sigma_0^{-1} \mu_0 + \mathbf \Sigma^{-1} \sum_{i=1}^N \mathbf x_i), (N \mathbf \Sigma^{-1} + \mathbf \Sigma_0^{-1})^{-1})$
Using the Woodbury identity on our expression for the covariance matrix:
$(N \mathbf \Sigma^{-1} + \mathbf \Sigma_0^{-1})^{-1} = \mathbf \Sigma(\frac{1}{N} \mathbf \Sigma + \mathbf \Sigma_0)^{-1} \frac{1}{N} \mathbf \Sigma_0$
Which provides the covariance matrix in the form the OP wanted. Using this expression (and its symmetry) further in the expression for the mean we have:
$\mathbf \Sigma(\frac{1}{N} \mathbf \Sigma + \mathbf \Sigma_0)^{-1} \frac{1}{N} \mathbf \Sigma_0 \mathbf \Sigma_0^{-1} \mu_0
+
\frac{1}{N} \mathbf \Sigma_0(\frac{1}{N} \mathbf \Sigma + \mathbf \Sigma_0)^{-1} \mathbf \Sigma \mathbf \Sigma^{-1} \sum_{i=1}^N \mathbf x_i$
$= \mathbf \Sigma(\frac{1}{N} \mathbf \Sigma + \mathbf \Sigma_0)^{-1} \frac{1}{N} \mu_0
+ \mathbf \Sigma_0(\frac{1}{N} \mathbf \Sigma + \mathbf \Sigma_0)^{-1} \sum_{i=1}^N (\frac{1}{N} \mathbf x_i)$
Which is the form required by the OP for the mean.
If I have understood your question correctly, you want the CDF of the bivariate normal distribution. That is, for the standardized case:
$$
\Phi(\mathrm{\pmb{b}},\rho) = \frac{1}{2\pi\sqrt{1-\rho^{2}}}\int_{-\infty}^{b_{1}}{\int_{-\infty}^{b_{2}}}\exp\left[{-(x^{2}-2\rho xy+y^{2}})/(2(1-\rho^{2})\right]\mathrm{d}y\mathrm{d}x
$$
This has no closed form solution and must be integrated numerically. With modern software, this is quite trivial.
Here is an example in R with perfectly correlated normal distributions (i.e. $\rho = 1$) with $\pmb{\mu}=(100, 40)^\intercal$ and covariance matrix $\Sigma = \begin{bmatrix} 225 & 75 \\ 75 & 25 \end{bmatrix}$. We calculate the probabiltiy of both variables being above $2$ SD:
library(mvtnorm)
library(MASS)
corr.mat <- matrix(c(1, 1, 1, 1), 2, 2, byrow = TRUE) # correlations
sd.mat <- matrix(c(15, 0, 0, 5), 2, 2, byrow = TRUE) # standard deviations
cov.mat <- sd.mat %*% corr.mat %*% sd.mat # covariance matrix
mu <- c(100, 40) # means
pmvnorm(lower = mu + 2*diag(sd.mat), upper = Inf, mean = mu, sigma = cov.mat)
[1] 0.02275013
attr(,"error")
[1] 2e-16
attr(,"msg")
[1] "Normal Completion"
As you rightly said: When they are perfectly correlated, the probability is about 2.3%.
What about a correlation of -0.21?
corr.mat <- matrix(c(1, -0.21, -0.21, 1), 2, 2, byrow = TRUE) # correlations
sd.mat <- matrix(c(15, 0, 0, 5), 2, 2, byrow = TRUE) # standard deviations
cov.mat <- sd.mat %*% corr.mat %*% sd.mat # covariance matrix
mu <- c(100, 40) # means
pmvnorm(lower = mu + 2*diag(sd.mat), upper = Inf, mean = mu, sigma = cov.mat)
[1] 0.0001228342
attr(,"error")
[1] 1e-15
attr(,"msg")
[1] "Normal Completion"
The probability is much lower, namely 0.0001228342.
We can verify our calculations by simulation. For the example above:
set.seed(21)
dat <- rmvnorm(1e7, mean = c(100, 40), sigma = cov.mat)
sum(dat[, 1] > (100+2*15) & dat[, 2] > (40+2*5))/dim(dat)[1]
[1] 0.0001261
This is very close to the result from numerical integration.
These calculations can easily be extended to multivariate normal distributions.
Best Answer
Wikipedia lists a number of anomaly detection examples which do not explicitly assume a normal distribution (although some arguments can be made about implicit assumptions). So the answer to your main question is yes, anomaly detection can work without the assumption of the underlying data being normally distributed.
You give a particular example of an anomaly detection algorithm, where we compute the joint $$ p(x)=p(x_1,\ldots,x_n)=p(x_1)\cdots p(x_n), $$ but we can only factor out $p(x)$ in such a way by assuming the $x_i$'s are independent. This is the only requirement for this relation to be true. That means that the $x_i$'s can have different distributions. (Consider an example where we independently toss an unfair coin with $\mathbb{P}(\text{head})=0.25$ and roll a fair die. The result of the unfair coin toss $w$ has a Bernoulli distribution, but the result of the fair die roll $z$ has a discrete uniform distribution. You can show that $p(w,z)=p(w)p(z)$, even though the distributions are different!)
For now, let's assume feature $x_i$ is normally distributed with mean $\mu_i$ and variance $\sigma_i^2$. Why do people usually assume a normal distribution for every feature? There are many possible reasons, but they often boil down to the following two concepts:
So essentially, we assume a Gaussian distribution for convenience. It's hard to predict how well the algorithm will perform if the normality assumption is incorrect; however, this assumption is prevalent throughout the literature and in practice, so it's a reasonable first try.
But nothing is stopping us from choosing different distributions for $x_i$. The math won't break down unless the independence assumption is violated, and even in practice, some mild dependence between features can be tolerated.
Ultimately, the performance of the algorithm will depend on the quality of the parameter estimates $\theta$ for our density of choice $p(x_i;\theta)$ and the suitability of that density for the true feature distribution.