Solved – Can anomaly detection work without the assumption of Normal Distribution of the underlying data

anomaly detectiondistributionsmachine learningnormal distribution

I've been reading up on Gaussian-based Anomaly detection and it seems, the classic assumption is to assume the data fits the normal distribution.

However, is this a hard requirement? What if one of my data points followed a Beta distribution, another one normal, yet another one Poisson, Gamma, etc.? Would the anomaly detection algorithm still work?

It seems the principle that $$p(x) = p(x_1;\mu_1,\sigma_1^2)\times …\times p(x_n;\mu_n,\sigma_n^2) < \varepsilon$$ should hold for any distribution of $x_1$ to $x_n$ and it need not be normal with respect to $\mu,\sigma$.

I wasn't able to find any reference that says that we could use the above formula for features $x_1$ to $x_n$ each of which can follow a different distribution. Most sources and Andrew Ng's Machine Learning class on Coursera too says that you should play with transformations to "transform" the data to a normal distribution if it isn't normal to begin with.

Questions:

  1. Why is the assumption of normality crucial to the algorithm?
  2. If we use different distributions for each of the features does the algorithm still work? How good/bad does it get?

Best Answer

Wikipedia lists a number of anomaly detection examples which do not explicitly assume a normal distribution (although some arguments can be made about implicit assumptions). So the answer to your main question is yes, anomaly detection can work without the assumption of the underlying data being normally distributed.

You give a particular example of an anomaly detection algorithm, where we compute the joint $$ p(x)=p(x_1,\ldots,x_n)=p(x_1)\cdots p(x_n), $$ but we can only factor out $p(x)$ in such a way by assuming the $x_i$'s are independent. This is the only requirement for this relation to be true. That means that the $x_i$'s can have different distributions. (Consider an example where we independently toss an unfair coin with $\mathbb{P}(\text{head})=0.25$ and roll a fair die. The result of the unfair coin toss $w$ has a Bernoulli distribution, but the result of the fair die roll $z$ has a discrete uniform distribution. You can show that $p(w,z)=p(w)p(z)$, even though the distributions are different!)

For now, let's assume feature $x_i$ is normally distributed with mean $\mu_i$ and variance $\sigma_i^2$. Why do people usually assume a normal distribution for every feature? There are many possible reasons, but they often boil down to the following two concepts:

  1. We can easily estimate $\mu_i$ and $\sigma_i^2$ via maximum likelihood: $$ \begin{split} \mu_i &= \frac{1}{N}\sum_{k=1}^N x^{(k)}_i\\ \sigma_i^2 &= \frac{1}{N}\sum_{k=1}^N(x^{(k)}_i-\mu_i)^2 \end{split} $$ and thus we have everything we need to know to compute $p(x)=\prod_i p(x_i;\mu_i,\sigma_i^2)$.
  2. The normality assumption is often a good approximation of reality.

So essentially, we assume a Gaussian distribution for convenience. It's hard to predict how well the algorithm will perform if the normality assumption is incorrect; however, this assumption is prevalent throughout the literature and in practice, so it's a reasonable first try.

But nothing is stopping us from choosing different distributions for $x_i$. The math won't break down unless the independence assumption is violated, and even in practice, some mild dependence between features can be tolerated.

Ultimately, the performance of the algorithm will depend on the quality of the parameter estimates $\theta$ for our density of choice $p(x_i;\theta)$ and the suitability of that density for the true feature distribution.