Solved – What are the differences between Lloyd’s, MacQueen’s and Hartigan’s algorithms for K-Means

k-meanspartitioningr

There are three distinct algorithms for the K-Means function in R. These are:

  • Lloyd's
  • MacQueen's
  • Hartigan's

I believe I understand how Lloyd's works. 1. The cluster centers are chosen. 2. Points are assigned to the nearest center. 3. Centers are shifted to the mean of the points assigned to them.

My question is about how MacQueen's and Hartigan's algorithms differ to it. Here are some more specific questions, but any information about these methods would be helpful:

  1. Supposedly MacQueen's is more efficient, because it updates centroids more often. (Source).

    At what stages does MacQueen's update centroids that Lloyd's algorithm does not? (the source is unclear to me)

    Why does this make it more efficient?

    What exactly does 'more efficient' even mean? (My tests show that MacQueen's and Lloyds and Hardigan's all run at the pretty much the same speed for any data).

  2. My tests show that Hartigan's algorithm is more accurate because it does not seem to be affected as much by the random initial centroid positions. Here is the test I did:

There are 4 points. The vertical distance between them is 1, and the horizontal distance between them is 1000.

enter image description here

Assuming there are 2 clusters, clearly the correct clusters (with the lowest sum of squares) would be clusters of points above and below each other.

enter image description here

But because of the random initial centroid positions, sometimes the algorithm converges to a local optimum, and produces this obviously wrong result.

enter image description here

After doing >1,000,000 runs on each of the three, I found that Lloyd's and MacQueen's were both correct 2/3 of the time, and wrong for the other 1/3. However, Hartigan's was correct 100% of the time.

How does Hartigan's method even work? And how does this make it more accurate in terms of 'fixing' the wrong results that are caused by the initial random centroids?

Best Answer

Could't figure out much from the graphs but to answer your questions-

At what stages does MacQueen's update centroids that Lloyd's algorithm does not? Macqueen recalculates the center every time an iteration is completed like Lloyd's but also everytime a data point changes it's subspace (Cluster). So many more centroid calculations in Macqueen.

Why does this make it more efficient? It might not be efficient always, Consider a scenario where both the clusters are easily separable. Macqueen moves slowly as calculating new centroids while Llyods would jump towards right centroids in every iteration in lesser centriod calculation. So i think efficiency is not guaranteed.

What exactly does 'more efficient' even mean? (My tests show that MacQueen's and Lloyds and Hardigan's all run at the pretty much the same speed for any data). - Overall performace, like there can be a case where K-means will lead to an empty cluster. While macqueen and Hartigan algos may give different results for same datasets( if order of points is high)

How does Hartigan's method even work? And how does this make it more accurate in terms of 'fixing' the wrong results that are caused by the initial random centroids? Please read the below paper- goto page 17. It takes within cluster sum of squares rather than distance as metrics to classification a point in any cluster.
https://core.ac.uk/download/pdf/27210461.pdf