Solved – K-means initial centers membership

clusteringk-meansr

I'm trying to plot all the steps of a k-means algorithm with r, but I can't.

The k-means algorithm works in this way:

  • Step 1. Initialize the center of the clusters
  • Step 2. Assign the closest initial centers to each data point
  • Step 3. Set the position of each cluster to the mean of all data points belonging to that cluster
  • Step 4. Assign the closest cluster to each data point
  • Step 5. Repeat steps 3-4 until convergence

I plot the dataset and initial centers of clusters (Step 1). Too, I can plot the new cluster centers and show which point belongs to each cluster (Step 3 and 4). But I don't know how to plot the Step 2. I need the very first initial centers membership of each point, before the first iteration, but kmeans() doesn't give it you. How could I calculate this?

Here is my code:

set.seed(2009)
points1 <- data.frame(x=rnorm( 50,1,0.1), y=rnorm(50,5,0.1))
points2 <- data.frame(x=rnorm( 50,5,0.1),  y=rnorm(50,5,0.1))
points3 <- data.frame(x=rnorm(200,3,0.8),y=rnorm(200,3,0.8))
df <- rbind(points1, points2, points3)

p <- ggplot(df, aes(x, y))
p + geom_point(size=7, color="grey") + labs(title="Initial configuration")

y <- c(4.88871745,4.88099143,3.69713723)
x <- c(0.75606015,1.26736958,3.04961545)
kcenters <- data.frame(x,y)

p + geom_point(size=7, color="grey") + 
    geom_point(data=kcenters, aes(x, y), size=7, color="black", shape="x") + 
    labs(title="Initial centers")

dfCluster <- kmeans(df, centers=kcenters, iter.max=1)

p + geom_point(size=7, aes(colour=as.factor(dfCluster$cluster))) + 
        geom_point(data=data.frame(dfCluster$center), aes(x, y), size=7, 
               color="black", shape="x") + 
    theme(legend.position="none") + 
    labs(title="First iteration")

My goal would be to show the initial center membership of each point in "Initial centers" plot.


Edit:

I think I did not explain myself properly.

On this website there is a simulation showing what I would like to get:

http://www.onmyphd.com/?p=k-means.clustering

When you click the "Iteration" button the first time (click1), the initial centers are placed. Pressing a second time (click 2), points are assigned to closer center, and painted with different colors. When you click the third time (click3), new centers are calculated, and when you press for the fourth time (click4), points are assigned to closer center again.

When you run kmeans() and stop it at the first iteration, you get the new centers of the clusters ( click3 ), dfCluster$center, and cluster membership of each point (click4), dfCluster$cluster, but you do not get the initial center membership of each point (click 2), which is exactly what I'm looking for.


I finally accomplished what I wanted: a step-by-step k-means. Sorry if the code it's not perfect, I'm a newbie with R.

#How does k-means work

library(ggplot2)

set.seed(2009)
points1<-data.frame(x=rnorm(50,1,0.1),y=rnorm(50,5,0.1))
points2<-data.frame(x=rnorm(50,5,0.1),y=rnorm(50,5,0.1))
points3<-data.frame(x=rnorm(200,3,0.8),y=rnorm(200,3,0.8))
df<-rbind(points1,points2,points3)

#plot initial points
p <- ggplot(df, aes(x, y))
p + geom_point(size=7, color="grey")

#set initial centers
kcenters<-df[c(49,26,297),]

#plot centers
p + geom_point(size=7, color="grey") + geom_point(data=kcenters, aes(x, y), size=7, color="black", shape="x")

#assignment (to calculate distances to initial centers and to allocate points to the cluster to which they are closest)
library(reshape)
distances <- melt(as.matrix(dist(df,diag=T,upper = T)), varnames = c("row", "col"))
dist_center1<-subset(distances,col==49,select = value)
dist_center2<-subset(distances,col==26,select = value)
dist_center3<-subset(distances,col==297,select = value)
dist_centers<-data.frame(dist_center1,dist_center2,dist_center3)
colnames(dist_centers)<-c("dist_center1","dist_center2","dist_center3")
dist_centers$cluster<-apply(dist_centers, 1, which.min)
df<-cbind(df,dist_centers)

#plot assignment
p + geom_point(size=7, aes(colour=as.factor(df$cluster))) + geom_point(data=kcenters, aes(x, y), size=7, color="black", shape="x") + theme(legend.position="none")

#calculate new centers
x<-tapply(df$x,df$cluster,mean)
y<-tapply(df$y,df$cluster,mean)
kcenters<-data.frame(x,y)

#plot new centers
p + geom_point(size=7, aes(colour=as.factor(df$cluster))) + geom_point(data=kcenters, aes(x, y), size=7, color="black", shape="x") + theme(legend.position="none")

And then, you can continue the procedure slightly adjusting the code above:

#assignment
df<-rbind(df[,1:2],kcenters)
row.names(df) <- NULL
distances <- melt(as.matrix(dist(df,diag=T,upper = T)), varnames = c("row", "col"))
dist_center1<-subset(distances,col==301,select = value)
dist_center2<-subset(distances,col==302,select = value)
dist_center3<-subset(distances,col==303,select = value)
dist_centers<-data.frame(dist_center1,dist_center2,dist_center3)
colnames(dist_centers)<-c("dist_center1","dist_center2","dist_center3")
dist_centers$cluster<-apply(dist_centers, 1, which.min)
df<-cbind(df[1:300,],dist_centers[1:300,])

#plot assignment
p + geom_point(size=7, aes(colour=as.factor(df$cluster))) + geom_point(data=kcenters, aes(x, y), size=7, color="black", shape="x") + theme(legend.position="none")

#calculate new centers
x<-tapply(df$x,df$cluster,mean)
y<-tapply(df$y,df$cluster,mean)
kcenters<-data.frame(x,y)

#plot new centers
p + geom_point(size=7, aes(colour=as.factor(df$cluster))) + geom_point(data=kcenters, aes(x, y), size=7, color="black", shape="x") + theme(legend.position="none")

If you run kmeans() with the same initial centers and stop it on first iteration, dfCluster<-kmeans(df,centers=kcenters, iter.max = 1), you get the follow centers:

> dfCluster$centers
         x        y
1 1.129419 4.905327
2 2.928011 2.880839
3 4.715513 4.766608

These centers do not match with the ones I get in the first iteration of my procedure (#calculate new centers). I have to run it for 14 times (#assigment and #calculate new centers) to obtain them. I don't know the meaning of "iteration" in the kmeans() procedure. Does anybody know?

Best Answer

In kmeans help you can read that there is centers argument that is

either the number of clusters, say k, or a set of initial (distinct) cluster centres. If a number, a random set of (distinct) rows in x is chosen as the initial centres.

So knowing that kmeans starts with allocating cluster centers randomly you can (1) choose by hand some random centers, (2) start the algorithm with iter.max=1 (single iteration), save results and (3) start it again with the previous output as centers. Next you repeat (2) and (3) it until convergence and you have your data.

Generally there is no point in recording the initial values since they are random, so they are not recorded.