Solved – How to mathematically prove that k-means clustering converges to minimum squared error

algorithmsclusteringk-means

I am using k-means clustering to analyze and obtain patterns in traffic data. This well-known algorithm performs 2 steps per iteration.

Assign each object to a cluster closest to it, based on the euclidean distance between the object and a cluster's center (obtained at a previous iteration).
Update the centers of the clusters based on the new members of clusters.

I understand that k-means clustering overall aims to minimize the sum of the square errors. However, how can we mathematically prove that the 2 steps eventually converges to this goal? It sounds like a complicated optimization problem involving Lagrange multipliers, but maybe there is a more straightforward way to prove it?

Best Answer

There is no k-means algorithm. K-means is the problem. Algorithms for it include MacQueen, Lloyd/Forgy, Hartigan/Wong and many more.
Most of these algorithms (all but exhaustive search I guess) will only find a local optimum, not the global optimum. They are heuristics. Fast heuristics...
It turns out the usual nearest-center heuristic sometimes even misses the local optimum, if cluster sizes vary much. Not by much. But rarely, points should not be assigned the nearest center.
Global optimium search is IIRC NP-hard, so you do not want to use a perfect algorithm, unless you assume that P=NP.

Related Solutions

Solved – Measuring k-means clustering quality on training and test sets

The problem, in particular with k-means applied to real world, labeled data is that clusters will usually not agree with your labels very well, unless you either generated the labels by using a similar clustering algorithm (self-fulfilling prophecy), or the data set is really simple.

Have you tried computing the k-means-statistics such as sums of squares etc. on the original data set? I would not at all be surprised if they are substantially worse than after running k-means.

I figure it's just another case of the algorithm does not fit to your problem.

Evaluating clustering algorithms is really hard. Because you actually want to find something you do not know yet. Even if the clustering would reproduce the original labels it then actually failed, because it did not tell you something new, and then you could just have used the labels instead.

Maybe the most realistic evaluation for a clustering algorithm is the following: if you incorporate the result from the clustering algorithm into a classification algorithm, does it improve the classification accuracy significantly? I.e. treat clustering as a preprocessing/support functionality for an algorithm that you can evaluate reasonably.

Solved – function in R that takes the centers of clusters that were found and assigns clusters to a new data set

You can compute the cluster assignments for a new data set with the following function:

clusters <- function(x, centers) {
  # compute squared euclidean distance from each sample to each cluster center
  tmp <- sapply(seq_len(nrow(x)),
                function(i) apply(centers, 1,
                                  function(v) sum((x[i, ]-v)^2)))
  max.col(-t(tmp))  # find index of min distance
}

# create a simple data set with two clusters
set.seed(1)
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
           matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
x_new <- rbind(matrix(rnorm(10, sd = 0.3), ncol = 2),
               matrix(rnorm(10, mean = 1, sd = 0.3), ncol = 2))
colnames(x_new) <- c("x", "y")

cl <- kmeans(x, centers=2)

all.equal(cl[["cluster"]], clusters(x, cl[["centers"]]))
# [1] TRUE
clusters(x_new, cl[["centers"]])
# [1] 2 2 2 2 2 1 1 1 1 1

plot(x, col=cl$cluster, pch=3)
points(x_new, col= clusters(x_new, cl[["centers"]]), pch=19)
points(cl[["centers"]], pch=4, cex=2, col="blue")

cluster assignment

or you could use the flexclust package, which has an implemented predict method for k-means:

library("flexclust")
data("Nclus")

set.seed(1)
dat <- as.data.frame(Nclus)
ind <- sample(nrow(dat), 50)

dat[["train"]] <- TRUE
dat[["train"]][ind] <- FALSE

cl1 = kcca(dat[dat[["train"]]==TRUE, 1:2], k=4, kccaFamily("kmeans"))
cl1    
#
# call:
# kcca(x = dat[dat[["train"]] == TRUE, 1:2], k = 4)
#
# cluster sizes:
#
#  1   2   3   4 
#130 181  98  91 

pred_train <- predict(cl1)
pred_test <- predict(cl1, newdata=dat[dat[["train"]]==FALSE, 1:2])

image(cl1)
points(dat[dat[["train"]]==TRUE, 1:2], col=pred_train, pch=19, cex=0.3)
points(dat[dat[["train"]]==FALSE, 1:2], col=pred_test, pch=22, bg="orange")

flexclust plot

There are also conversion methods to convert the results from cluster functions like stats::kmeans or cluster::pam to objects of class kcca and vice versa:

as.kcca(cl, data=x)
# kcca object of family ‘kmeans’ 
#
# call:
# as.kcca(object = cl, data = x)
#
# cluster sizes:
#
#  1  2 
#  50 50

Best Answer

Related Solutions

Solved – Measuring k-means clustering quality on training and test sets

Solved – function in R that takes the centers of clusters that were found and assigns clusters to a new data set

Related Question