Solved – Gaussian Mixture for detecting outliers

clusteringgaussian mixture distributionmachine learningoutliers

I'm trying to make a simple outlier detection program that is able to correctly, or almost correctly, identify values in a data set that could be potential outliers because they don't fall in the distribution of the rest of the values in the same data set.

  1. I can't use supervised techniques like classification or regression because I am not granted any historical, labeled data to train a model with, so I will be using unsupervised techniques, like clustering.

  2. I was going to use k-means clustering, but I read multiple articles saying that k-means works horribly with outliers, and some of them recommended me to try a gaussian mixture model.

I know Gaussian Mixture Models work by creating different clusters that represent different distributions. I am using Spark's (Apache) version of Gaussian Mixture Model and this gives me two columns relevant to my problem: a prediction column that gives me the cluster for which a data point in the data set has been assigned to, and a probability column, which is a column that gives me the probabilities that each value has to be assigned to each one of the clusters. Working with this approach, how can I determine outliers?

I thought of labeling as outliers those values which are on the lower cluster (cluster with the smallest number of points) but this is not a good approach because on the scenario that there are no outliers, there will always be one cluster smaller than the rest since GMM doesn't evenly distributes the values in the clusters. Any alternative approach I could use?

Best Answer

There is a smart way to do this that is implemented by JMP software. In the GMM fitting, there is an option for "outlier cluster" that can be checked. The description of this is below:

The outlier cluster option assumes a uniform distribution and is less sensitive to outliers than the standard Normal Mixtures method. This fits a cluster to catch outliers that do not fall into any of the normal clusters. The distribution of observations that fall in the outlier cluster is assumed to be uniform over the hypercube that encompasses the observations.

So what does this mean? Well, it's just an additional latent factor (distribution) with a prior (same as the other mixture components) that is updated during the expectation step. Naturally the data points that don't fall near a legitimate Gaussian cluster end up with a higher probability of being part of the [sparse] uniform distribution.

It works well and is something akin to finding outliers via DBSCAN clustering except with less tuning and investigation up front to set hyperparameters....but frankly it's not really that much more magical than just fitting a GMM without it and taking something like the lowest 0.5% quantile of points or similar (the quantile % then becomes a hyperparameter). The only difference here is that the output of the algorithm chooses them as a result of the fitting. Note however the group membership results will change with the number of latent units (which is a hyperparameter in the case of a GMM), so you either pay Peter or Paul...there's nothing out there that will tell you what an outlier is without making some kind of assumption or setting a hyper-parameter up front.