The problem, in particular with k-means applied to real world, labeled data is that clusters will usually not agree with your labels very well, unless you either generated the labels by using a similar clustering algorithm (self-fulfilling prophecy), or the data set is really simple.
Have you tried computing the k-means-statistics such as sums of squares etc. on the original data set? I would not at all be surprised if they are substantially worse than after running k-means.
I figure it's just another case of the algorithm does not fit to your problem.
Evaluating clustering algorithms is really hard. Because you actually want to find something you do not know yet. Even if the clustering would reproduce the original labels it then actually failed, because it did not tell you something new, and then you could just have used the labels instead.
Maybe the most realistic evaluation for a clustering algorithm is the following: if you incorporate the result from the clustering algorithm into a classification algorithm, does it improve the classification accuracy significantly? I.e. treat clustering as a preprocessing/support functionality for an algorithm that you can evaluate reasonably.
You can compute the cluster assignments for a new data set with the following function:
clusters <- function(x, centers) {
# compute squared euclidean distance from each sample to each cluster center
tmp <- sapply(seq_len(nrow(x)),
function(i) apply(centers, 1,
function(v) sum((x[i, ]-v)^2)))
max.col(-t(tmp)) # find index of min distance
}
# create a simple data set with two clusters
set.seed(1)
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
x_new <- rbind(matrix(rnorm(10, sd = 0.3), ncol = 2),
matrix(rnorm(10, mean = 1, sd = 0.3), ncol = 2))
colnames(x_new) <- c("x", "y")
cl <- kmeans(x, centers=2)
all.equal(cl[["cluster"]], clusters(x, cl[["centers"]]))
# [1] TRUE
clusters(x_new, cl[["centers"]])
# [1] 2 2 2 2 2 1 1 1 1 1
plot(x, col=cl$cluster, pch=3)
points(x_new, col= clusters(x_new, cl[["centers"]]), pch=19)
points(cl[["centers"]], pch=4, cex=2, col="blue")
or you could use the flexclust package, which has an implemented predict
method for k-means:
library("flexclust")
data("Nclus")
set.seed(1)
dat <- as.data.frame(Nclus)
ind <- sample(nrow(dat), 50)
dat[["train"]] <- TRUE
dat[["train"]][ind] <- FALSE
cl1 = kcca(dat[dat[["train"]]==TRUE, 1:2], k=4, kccaFamily("kmeans"))
cl1
#
# call:
# kcca(x = dat[dat[["train"]] == TRUE, 1:2], k = 4)
#
# cluster sizes:
#
# 1 2 3 4
#130 181 98 91
pred_train <- predict(cl1)
pred_test <- predict(cl1, newdata=dat[dat[["train"]]==FALSE, 1:2])
image(cl1)
points(dat[dat[["train"]]==TRUE, 1:2], col=pred_train, pch=19, cex=0.3)
points(dat[dat[["train"]]==FALSE, 1:2], col=pred_test, pch=22, bg="orange")
There are also conversion methods to convert the results from cluster functions like stats::kmeans
or cluster::pam
to objects of class kcca
and vice versa:
as.kcca(cl, data=x)
# kcca object of family ‘kmeans’
#
# call:
# as.kcca(object = cl, data = x)
#
# cluster sizes:
#
# 1 2
# 50 50
Best Answer
There is no k-means algorithm. K-means is the problem. Algorithms for it include MacQueen, Lloyd/Forgy, Hartigan/Wong and many more.
Most of these algorithms (all but exhaustive search I guess) will only find a local optimum, not the global optimum. They are heuristics. Fast heuristics...
It turns out the usual nearest-center heuristic sometimes even misses the local optimum, if cluster sizes vary much. Not by much. But rarely, points should not be assigned the nearest center.
Global optimium search is IIRC NP-hard, so you do not want to use a perfect algorithm, unless you assume that P=NP.