Solved – k-means++ algorithm and outliers

clusteringk-means

It is well known that k-means algorithm suffers in the presence of outliers. k-means++ is one effective method for cluster center initalization. I was going through the PPT by the founders of the method, Sergei Vassilvitskii and David Arthur http://theory.stanford.edu/~sergei/slides/BATS-Means.pdf (slide 28) which shows that the cluster center initialization is not affected by outlier as seen below. enter image description here

As per the k-means++ method, the farthermost points are more likely to be the initial centers. In this way the outlier point (the rightmost point) must also be an initial cluster centroid. What is the explanation for the figure?

Best Answer

In K-Means++ method, the first centroid (let's call C1) can be among any of the points. In other words, all points have an equal chance of getting selected as first centroid in the initialization stage. Now, once the first centroid is selected, the second centroid is selected in such as manner that, a probability of any point in the dataset (barring C1 of course) to get selected as second centroid is proportional to its squared-distance from the first centroid. Likewise, this logic extends for selecting the rest of the centroids among the data points in the initialization stage. In general, the probability that a point gets selected as centroid is proportional to its distance from the nearest centroid. In this way, the probabilistic approach in K-means++ as opposed to the random selection in K-Means ensures that the centroids selected in the initialization stage are as far as possible.

Now coming to your question, K-means++ is still sensitive to the outliers. One workaround could be removing outliers using techniques like LOF, RANSAC, simple univariate box-plots, etc. before clustering. The other I think could be reinitialize the Centroids in case you are getting sub-optimal performance in the first attempt.