Solved – Best clustering technique for outlier detection

clusteringk-meansoutliers

I have around 15-20 points every second, and I would like to detect outliers based on

-their density along x-axis , that means if I am using k-mean clustering then I specify that in x-direction max of a variance is tolerated and in y max of b is tolerated. and a << b for this case.

Hence I get elliptical clusters.

Moreover, I think that clustering algorithms based on density clustering will suit my problem or if you recommend any other please suggest.

In the following figure, the most I am interested in the points which are near to -10 at x-axis and I would like to retain them rest the ones >-12. I want to discard them, but I would need a general clustering method to do it for all the data, as the point (-10) in this specific case, changes from season to season.

Thank you for your time, and please let me know if any further info is required or you have any tips . It would be most welcome

Plot 2
Plot 1

Best Answer

Why do you insist on using clustering?

To me it sounds as if simple k-nearest-neighbor distance (and in fact, nearest-neighbor distance) should just work for you. And that is a simple as it can get.

Note that it is trivial to add weights $\Omega=\{\omega_1,\ldots,\omega_d\}$ into Euclidean distance:

$$ \text{Euclidean}_\Omega(x,y) := \sqrt{\sum_{i=1}^{d} \omega_i(x_i-y_i)^2} $$

Related Solutions

Solved – Clustering based anomaly detection

This has, of course, been tried before.

In fact, there is an implementation in ELKI, based on k-means.

The problem is, this isn't as easy as it sounds.

k-means itself is sensitive to outliers.
the points inbetween clusters aren't necessarily outliers, but may have a high score
the points away from all clusters can be more easily found by computing the distance from the data set center.

So in the end, it simply doesn't work.

DBSCAN is a clustering algorithm that has a concept of "noise". Which are primitive outliers - objects in regions of low density.

There are many more - more advanced - outlier detection methods available in ELKI; which worked much better for me than the k-means based thing.

Solved – Unsupervised outlier detection in 2D space

Your task seems to be rather a clustering than an outlier detection task.

In the following, I use this popular data set of User locations (Joensuu).

Running OPTICS with the parameters

-dbc.in /tmp/MopsiLocations2012-Joensuu.txt
-algorithm clustering.optics.OPTICSXi -opticsxi.xi 0.05
-algorithm.distancefunction geo.LngLatDistanceFunction
-optics.epsilon 5000.0 -optics.minpts 50

yields the following (hierarchical) clustering. You can see there are three larger clusters (corresponding to Joensuu, Lieska, and Savijärvi; note that the plot has latitude and longitude 'the wrong way'), and some noise (violet here) that is not density-reachable with 5km distance and 50 points. These are your outliers.

You can tell there are some subclusters in both cities. For example one corresponding to the Prisma Joensuu shopping mall. To see more detail, it is helpful to further reduce epsilon, maybe to just 500 meters.

Best Answer

Related Solutions

Solved – Clustering based anomaly detection

Solved – Unsupervised outlier detection in 2D space

Related Question