Solved – Anomaly detection based on clustering

anomaly detectionclusteringoutliers

I understand that there a lot of different methods for anomaly detection, based on classification, clustering, nearest neighbors, statistical, etc.

I'm trying out clustering based approach. So, I'm clustering data and as a result getting some representatives (centroids, medoids) and each cluster has some kind of total (or if you would like, average) distance.

This cluster representatives form a model. My question is, what to do next when you have a model? What would be a good anomaly metric?

I have some ideas, such as comparing the distance of object being questioned to existing representatives and than comparing that distance with the average distance in the closest cluster. But, I could also use some multiple of that distance. If you consider K-means I basically have a variance for a cluster, so I could extend that to standard deviation and us 3*sigma, as a known value for finding objects that are "not usual". But is that a good approach?

I have mentioned my reasoning about K-means, but what happens when you have a medoid? You don't have variance anymore, but you do work with some similarity measure. What to use than?

So to concertize my question – how to detect anomalies after performing clustering which produces some representatives such as centroids or medoids?

Best Answer

ELKI includes a class called KMeansOutlierDetection (and many more).

But of all the methods that I have tried, this one worked worst:

enter image description here

Even on easy, artificial data it doesn't work too well, except for the trivial objects (that literally any method will detect).

The problems with cluster-based outlier detection is that you need a really really good clustering result for this to work. On this data set, k-means does not work too well (the colors are not k-means clusters).

Here, k-means did not work too well, and thus you have false outliers along the bad cuts that k-means did:

enter image description here

Even worse, k-means is sensitive to outliers. So when you have lots of outliers, it tends to produce really bad results. You will want to first remove outliers, then run k-means; not the other way round!

You will end up having lots of outliers at the borders between clusters. But if the clusters are not good, this may well be in the very middle of the data!

Related Question