Solved – Can anomaly detection work without the assumption of Normal Distribution of the underlying data

anomaly detectiondistributionsmachine learningnormal distribution

I've been reading up on Gaussian-based Anomaly detection and it seems, the classic assumption is to assume the data fits the normal distribution.

However, is this a hard requirement? What if one of my data points followed a Beta distribution, another one normal, yet another one Poisson, Gamma, etc.? Would the anomaly detection algorithm still work?

It seems the principle that $$p(x) = p(x_1;\mu_1,\sigma_1^2)\times …\times p(x_n;\mu_n,\sigma_n^2) < \varepsilon$$ should hold for any distribution of $x_1$ to $x_n$ and it need not be normal with respect to $\mu,\sigma$.

I wasn't able to find any reference that says that we could use the above formula for features $x_1$ to $x_n$ each of which can follow a different distribution. Most sources and Andrew Ng's Machine Learning class on Coursera too says that you should play with transformations to "transform" the data to a normal distribution if it isn't normal to begin with.

Questions:

Why is the assumption of normality crucial to the algorithm?
If we use different distributions for each of the features does the algorithm still work? How good/bad does it get?

Best Answer

Wikipedia lists a number of anomaly detection examples which do not explicitly assume a normal distribution (although some arguments can be made about implicit assumptions). So the answer to your main question is yes, anomaly detection can work without the assumption of the underlying data being normally distributed.

You give a particular example of an anomaly detection algorithm, where we compute the joint $$ p(x)=p(x_1,\ldots,x_n)=p(x_1)\cdots p(x_n), $$ but we can only factor out $p(x)$ in such a way by assuming the $x_i$'s are independent. This is the only requirement for this relation to be true. That means that the $x_i$'s can have different distributions. (Consider an example where we independently toss an unfair coin with $\mathbb{P}(\text{head})=0.25$ and roll a fair die. The result of the unfair coin toss $w$ has a Bernoulli distribution, but the result of the fair die roll $z$ has a discrete uniform distribution. You can show that $p(w,z)=p(w)p(z)$, even though the distributions are different!)

For now, let's assume feature $x_i$ is normally distributed with mean $\mu_i$ and variance $\sigma_i^2$. Why do people usually assume a normal distribution for every feature? There are many possible reasons, but they often boil down to the following two concepts:

We can easily estimate $\mu_i$ and $\sigma_i^2$ via maximum likelihood: $$ \begin{split} \mu_i &= \frac{1}{N}\sum_{k=1}^N x^{(k)}_i\\ \sigma_i^2 &= \frac{1}{N}\sum_{k=1}^N(x^{(k)}_i-\mu_i)^2 \end{split} $$ and thus we have everything we need to know to compute $p(x)=\prod_i p(x_i;\mu_i,\sigma_i^2)$.
The normality assumption is often a good approximation of reality.

So essentially, we assume a Gaussian distribution for convenience. It's hard to predict how well the algorithm will perform if the normality assumption is incorrect; however, this assumption is prevalent throughout the literature and in practice, so it's a reasonable first try.

But nothing is stopping us from choosing different distributions for $x_i$. The math won't break down unless the independence assumption is violated, and even in practice, some mild dependence between features can be tolerated.

Ultimately, the performance of the algorithm will depend on the quality of the parameter estimates $\theta$ for our density of choice $p(x_i;\theta)$ and the suitability of that density for the true feature distribution.

Related Solutions

Solved – Multivariate normal posterior

With the distributions on our random vectors:

$\mathbf x_i | \mathbf \mu \sim N(\mu , \mathbf \Sigma)$

$\mathbf \mu \sim N(\mathbf \mu_0, \mathbf \Sigma_0)$

By Bayes's rule the posterior distribution looks like:

$p(\mu| \{\mathbf x_i\}) \propto p(\mu) \prod_{i=1}^N p(\mathbf x_i | \mu)$

So:

$\ln p(\mu| \{\mathbf x_i\}) = -\frac{1}{2}\sum_{i=1}^N(\mathbf x_i - \mu)'\mathbf \Sigma^{-1}(\mathbf x_i - \mu) -\frac{1}{2}(\mu - \mu_0)'\mathbf \Sigma_0^{-1}(\mu - \mu_0) + const$

$ = -\frac{1}{2} N \mu' \mathbf \Sigma^{-1} \mu + \sum_{i=1}^N \mu' \mathbf \Sigma^{-1} \mathbf x_i -\frac{1}{2} \mu' \mathbf \Sigma_0^{-1} \mu + \mu' \mathbf \Sigma_0^{-1} \mu_0 + const$

$ = -\frac{1}{2} \mu' (N \mathbf \Sigma^{-1} + \mathbf \Sigma_0^{-1}) \mu + \mu' (\mathbf \Sigma_0^{-1} \mu_0 + \mathbf \Sigma^{-1} \sum_{i=1}^N \mathbf x_i) + const$

$= -\frac{1}{2}(\mu - (N \mathbf \Sigma^{-1} + \mathbf \Sigma_0^{-1})^{-1}(\mathbf \Sigma_0^{-1} \mu_0 + \mathbf \Sigma^{-1} \sum_{i=1}^N \mathbf x_i))' (N \mathbf \Sigma^{-1} + \mathbf \Sigma_0^{-1}) (\mu - (N \mathbf \Sigma^{-1} + \mathbf \Sigma_0^{-1})^{-1}(\mathbf \Sigma_0^{-1} \mu_0 + \mathbf \Sigma^{-1} \sum_{i=1}^N \mathbf x_i)) + const$

Which is the log density of a Gaussian:

$\mu| \{\mathbf x_i\} \sim N((N \mathbf \Sigma^{-1} + \mathbf \Sigma_0^{-1})^{-1}(\mathbf \Sigma_0^{-1} \mu_0 + \mathbf \Sigma^{-1} \sum_{i=1}^N \mathbf x_i), (N \mathbf \Sigma^{-1} + \mathbf \Sigma_0^{-1})^{-1})$

Using the Woodbury identity on our expression for the covariance matrix:

$(N \mathbf \Sigma^{-1} + \mathbf \Sigma_0^{-1})^{-1} = \mathbf \Sigma(\frac{1}{N} \mathbf \Sigma + \mathbf \Sigma_0)^{-1} \frac{1}{N} \mathbf \Sigma_0$

Which provides the covariance matrix in the form the OP wanted. Using this expression (and its symmetry) further in the expression for the mean we have:

$\mathbf \Sigma(\frac{1}{N} \mathbf \Sigma + \mathbf \Sigma_0)^{-1} \frac{1}{N} \mathbf \Sigma_0 \mathbf \Sigma_0^{-1} \mu_0 + \frac{1}{N} \mathbf \Sigma_0(\frac{1}{N} \mathbf \Sigma + \mathbf \Sigma_0)^{-1} \mathbf \Sigma \mathbf \Sigma^{-1} \sum_{i=1}^N \mathbf x_i$

$= \mathbf \Sigma(\frac{1}{N} \mathbf \Sigma + \mathbf \Sigma_0)^{-1} \frac{1}{N} \mu_0 + \mathbf \Sigma_0(\frac{1}{N} \mathbf \Sigma + \mathbf \Sigma_0)^{-1} \sum_{i=1}^N (\frac{1}{N} \mathbf x_i)$

Which is the form required by the OP for the mean.

Solved – Correlated bivariate normal distribution: finding percentage of of data which is 2 standard deviations above the mean

If I have understood your question correctly, you want the CDF of the bivariate normal distribution. That is, for the standardized case:

$$ \Phi(\mathrm{\pmb{b}},\rho) = \frac{1}{2\pi\sqrt{1-\rho^{2}}}\int_{-\infty}^{b_{1}}{\int_{-\infty}^{b_{2}}}\exp\left[{-(x^{2}-2\rho xy+y^{2}})/(2(1-\rho^{2})\right]\mathrm{d}y\mathrm{d}x $$

This has no closed form solution and must be integrated numerically. With modern software, this is quite trivial.

Here is an example in R with perfectly correlated normal distributions (i.e. $\rho = 1$) with $\pmb{\mu}=(100, 40)^\intercal$ and covariance matrix $\Sigma = \begin{bmatrix} 225 & 75 \\ 75 & 25 \end{bmatrix}$. We calculate the probabiltiy of both variables being above $2$ SD:

library(mvtnorm)
library(MASS)

corr.mat <- matrix(c(1, 1, 1, 1), 2, 2, byrow = TRUE) # correlations
sd.mat <- matrix(c(15, 0, 0, 5), 2, 2, byrow = TRUE) # standard deviations

cov.mat <- sd.mat %*% corr.mat %*% sd.mat # covariance matrix

mu <- c(100, 40) # means

pmvnorm(lower = mu + 2*diag(sd.mat), upper = Inf, mean = mu, sigma = cov.mat)

[1] 0.02275013
attr(,"error")
[1] 2e-16
attr(,"msg")
[1] "Normal Completion"

As you rightly said: When they are perfectly correlated, the probability is about 2.3%.

What about a correlation of -0.21?

corr.mat <- matrix(c(1, -0.21, -0.21, 1), 2, 2, byrow = TRUE) # correlations
sd.mat <- matrix(c(15, 0, 0, 5), 2, 2, byrow = TRUE) # standard deviations

cov.mat <- sd.mat %*% corr.mat %*% sd.mat # covariance matrix

mu <- c(100, 40) # means

pmvnorm(lower = mu + 2*diag(sd.mat), upper = Inf, mean = mu, sigma = cov.mat)

[1] 0.0001228342
attr(,"error")
[1] 1e-15
attr(,"msg")
[1] "Normal Completion"

The probability is much lower, namely 0.0001228342.

We can verify our calculations by simulation. For the example above:

set.seed(21)
dat <- rmvnorm(1e7, mean = c(100, 40), sigma = cov.mat)

sum(dat[, 1] > (100+2*15) & dat[, 2] > (40+2*5))/dim(dat)[1]

[1] 0.0001261

This is very close to the result from numerical integration.

These calculations can easily be extended to multivariate normal distributions.

Best Answer

Related Solutions

Solved – Multivariate normal posterior

Solved – Correlated bivariate normal distribution: finding percentage of of data which is 2 standard deviations above the mean

Related Question