Solved – Anomaly detection: multivariate Gaussian distribution

clusteringdistributionsmultivariate analysisnormal distributionoutliers

I am trying to do anomaly detection on a heterogeneous dataset (There are unknown groups present in the dataset). I want to try multivariate Gaussian distribution based approach, but I was thinking of the following problem:

Should I try to use a single multivariate Gaussian distribution for the entire dataset or should I try to cluster the dataset first and for each of the clusters, I should use a different multivariate Gaussian distribution? My intuition tells me to do the latter, but I am a bit hesitant to use K-Means clustering (My dataset has millions of records, but few features < 100).

Would you kindly advise?

Best Answer

If you plan on assuming Gaussian distributions, use Gaussian Mixture Modeling (EM algorithm on Wikipedia) instead of k-means. Why optimize the squared deviations, when you could instead optimize the fit of your gaussian distributions?

There is also an implementation in ELKI of that. It works - as long as your data is nicely Gaussian. Once your data is not as nicely, the clustering may return rather random results, and the outlier scores will be all over the place, unfortunately.

There is also a stick-breaking prior for Gaussian modeling (I haven't seen that in ELKI yet). I tried it with scipy (DPGMM, Dirichlet Process Gaussian Mixture Models), but it did not work very well at all for me, even on artificial gaussian data - the best possible data set for this type of problem... Essentially, I have tried the code from this SO question (by increasing the number of iterations, it would eventually converge to a reasonable solution). However, despite the data being generated from 5 well separated gaussian distributions, the clustering it produced merged several clusters. So the result was much worse than using a fixed k and the regular GMM clustering approach...