Solved – Why is K-Means++ SLOWER than random initialization K-Means

clusteringk-means

K-Means is an iterative clustering method which randomly assigns initial centroids and shifts them to minimize the sum of squares. One problem is that, because the centroids are initially random, a bad starting position could cause the algorithm to converge at a local optimum.

K-Means++ was designed to combat this – It chooses the initial centroids using a weighted method which makes it more likely that points further away will be chosen as the initial centroids. The idea is that while initialization is more complex and will take longer, the centroids will be more accurate and thus fewer iterations are needed, hence it will reduce overall time. (Source)

In fact, the people who devised K-Means++ tested how fast it could cluster data, and found that it was twice as fast. (Source)

However, some basic tests in R show that K-Means++ requiring fewer iterations than K-Means does not make up for the extra time taken to initialize, even for normal sized datasets.

The test:

I tested it with datasets sized from 100 to a few thousand points. The data is named 'comp' in the code. If you want to test it, you can use whatever dataset you want.

K-Means

First, we do the clustering:

k <- kmeans(comp, 2, nstart=1, iter.max=100, algorithm = "Lloyd")

Next, the sum of squares can be added to a list

results <- k$withinss

Now, if we put the clustering in a loop, which adds to the 'results' list every time it loops, we can see how many times it has looped in a certain amount of time

repeat {
k <- kmeans(comp, 2, nstart=1, iter.max=10, algorithm = "Lloyd")

results <- c(results, k$withinss)
}

(I did it in this inefficient way because I initially used this code to test the accuracy i.e which method had the average total Sum of Squares)

If we let the loop run for 60s, we find that the list is 132,482 objects long. (it looped 66241 times since each time adds two objects to the list).

K-Means++

Now compare that with ++ initialisation.

#Set-up
library(LICORS)
k <- kmeanspp(comp, k = 2, start = "random", iter.max = 100, nstart = 1)

results <- k$withinss

#Loop
repeat {
k <- kmeanspp(comp, k = 2, start = "random", iter.max = 100, nstart = 1)

results <- c(results, k$withinss)
}

The 'results' list ended up having 22712 objects (it looped 11356 times).

K-Means was over 5 times faster than K-Means++, so this is clearly not just a measurement error. The ratio changes depending on the dataset I use for the test, but I've tried everything up to datasets with thousands of points, and the results consistently show that K-Means++ is slower.

My first thought was that maybe the package I used (LICORS) has inefficient code for performing K-Means, but then I saw that LICORS actually uses the default kmeans function after ++ initialization. In other words: Everything was the same except for the method of initialization, and ++ was slower. (another package for K-Means++ called flexclust which uses different code was even slower!)

Perhaps the dataset needs to have tens of thousands of points for K-Means++ to be faster? In this case, it would be very misleading for every source I've seen to say that K-Means++ is faster. Perhaps I've misunderstood something, or there's something wrong with the test?

Can any experts here (such as Tim, who claims here that K-Means++ is faster) explain these results?

Best Answer

kmeans in R is pretty good Fortran code.

I haven't looked at the package you got kmeanspp from, but I wouldn't expect it to be of the same quality. If it uses R code to do the initialization that can hurt badly.

So in the end, you haven't been benchmarking kmeans vs. kmeans++, but you have been showing that the quality of R packages varies a lot, and the R interpreter is substantially slower than Fortran/C/C++ functions, and you therefore cannot rely on such benchmarks.

The low performance of flexclust is probably because it uses even more R code?