Solved – Difference between Hartigan & Wong Algo to Lloyd’s algorithm in K-means clustering

algorithmsclusteringk-meansmachine learningunsupervised learning

In the iterations of Hartigan and Wong Algo of K-Means clustering, If the centroid is updated in the last step, for each data point included, the within- cluster sum of squares for each data point if included in another cluster is calculated.
If one of the cluster sum of squares is smaller than the current one, the data point( case) would be assigned to new cluster.

My question is- How is it different from Llyod's algorithm?
In Lloyd's new data points will be assigned to new cluster if distance from its(new cluster) lesser that current one. I am wondering how is it possible for a data point to have higher distance from another cluster( new cluster) but lesser squared distance from same another cluster( new cluster) ? paper I referred- https://core.ac.uk/download/pdf/27210461.pdf

I think it is sth to do with number of data points in a cluster, but I could't related it to any example where a data point falls in different clusters in both the algos. Can some please explain on it?

Best Answer

Lloyd's kmeans also minimizes the sum of squares (it does not minimize Euclidean distances). So no difference here.

But the standard algorithm is pretty dumb. It recomputed all the distances in each iteration. Hartigan's also pays attention to a very important fact: of you add a point to a cluster, the mean will change. This will increase the distance of the other points to the mean. So reassigning a point from one cluster to another to slightly reduce the points distance can reduce the quality of all the other points in the cluster to the point where this was not a smart change to perform.

Related Solutions

Solved – Run time analysis of the clustering algorithm (k-means)

Looking at these notes time complexity of Lloyds algorithm for k-means clustering is given as:

O(n * K * I * d)

n : number of points
K : number of clusters
I : number of iterations
d : number of attributes

My gut feeling is that in your case number of iterations (and number of attributes) is assumed to be constant.

Solved – Using real data means as centroids for clustering

First, to be clear, the term "centroid" is just another way of saying the "mean". K-means clustering, when performed in accordance to its definition, should always be redefining the mean of the cluster exactly as the real mean of all the data points that were categorized into that cluster on that iteration. It seems like your point of confusion is on the re-classification step, but I'm not completely sure what your question is.

Yes, when re-classifying, you are using the mean of one distinct set of points in order to define a cluster of another distinct set of points - whose mean will likely be different from the mean of the first set of points. But these centroids/means are always the correct "real" mean of a particular set of points (after all, that's how you find these points), except perhaps your first assignments. I hope that helps, and that I'm not just restating a bunch of stuff you may already know. Perhaps try to clarify your question a little more.

There are other algorithms similar to k-means that don't always necessarily use the means - like k-medians, for example.

Best Answer

Related Solutions

Solved – Run time analysis of the clustering algorithm (k-means)

Solved – Using real data means as centroids for clustering

Related Question