Solved – Clustering based anomaly detection

anomaly detectionclusteringnovelty-detectionoutliers

I'm trying to implement anomaly detection based on clustering. I'm hopping for confirmation of my approach, and I'm exposing my idea, being aware that I could have miss something in my analysis, so any suggestions would also be very appreciated. I'm a beginner in this area, so any help would be great.

I must also add that the data is in the time series form, and that there is a number of parameters considered.

My approach is next:

Let's say that there are 3 parameters, a, b and c. Since parameters are in time series form, I'm using sliding window, lets assume that size of the window is 4.

My representation of point is

[a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4]

[a2, a3, a4, a5, b2, b3, b4, b5, c2, c3, c4, c5]

And so on.

Then I'm using clustering (K means clustering, where I use some kind of automatized elbow method to determine the number of clusters), which gives me centroids that represent what normal signals (normal values of parameters in a window) look like.

Based on that centroids and the distance of points to its nearest centroid (concretely certain number of points farthest from centroid) when a new measurement arrives, I assign it to nearest centroid and its distance to centroid, I determine weather a signal is anomalous.

Is my approach sound, if not, what should I consider to make it work.

I have tried it on data that I have synthesized and got "good" results, by testing on injected anomalies. But I'm still not sure in this approach, because of my lack of experience.

Any comment would be great.

Thanks

Best Answer

This has, of course, been tried before.

In fact, there is an implementation in ELKI, based on k-means.

The problem is, this isn't as easy as it sounds.

  • k-means itself is sensitive to outliers.
  • the points inbetween clusters aren't necessarily outliers, but may have a high score
  • the points away from all clusters can be more easily found by computing the distance from the data set center.

So in the end, it simply doesn't work.

DBSCAN is a clustering algorithm that has a concept of "noise". Which are primitive outliers - objects in regions of low density.

There are many more - more advanced - outlier detection methods available in ELKI; which worked much better for me than the k-means based thing.