R K-means – Reading K-means Data and Creating Visualizations in R

data visualizationk-meansr

I performed and plotted a kmeans analysis in R with the following commands:

 km = kmeans(t(mat2), centers = 4)
 plotcluster(t(mat2), km$cluster)      #from library(fpc)

Here is the result from the plot:enter image description here

What I want to know is how to make sense of the km$centers attribute and the plot. What I know is that km$centers is a 4 X 31 matrix. Each row represents the corresponding cluster. I think each column represents an iteration in the algorithm (correct me if I am wrong) so the final iteration and result of the algorithm for the centers would be given by:

km$centers[, 31]

 0.008785652 -0.088641371 -0.012666252 -0.079348292 

I must be wrong about a lot because this leads to the following questions:

  1. The centers given bykm$centers are not (x, y) coordinates. How do I get these (x, y) center coordinates?
  2. The center for cluster 4 (according to the plot) must be something like (12, 2) but the above center numbers do not reflect any of these coordinates. In fact every number in the 4 X 31 matrix is less than 1. So, what is the relationship between km$centers and the plot?

The ultimate goal here is to create a matching (not mentioned here) based upon the (x, y) coordinates.

All help is greatly appreciated!

Best Answer

I think you're getting hung up on the difference between the center of the actual cluster vs. the center of the 1s, 2s, etc. on your plot.

The actual center of your cluster is in a high-dimensional space, where the number of dimensions is determined by the number of attributes you're using for clustering. For example, if your data has 100 rows and 8 columns, then kmeans interprets that has having 100 examples to cluster, each of which has eight attributes. Suppose you call:

km = kmeans(myData, 4)

Then, km$centers will be a matrix with four rows and eight columns. The center of cluster #1 is in km$centers[1,:]--the eight values there give its position in the 8-D space. Cluster #2's center is in km$centers[2,:] and so on. If you had eighty attributes instead, then each center (e.g., km$centers[1,:], km$centers[2,:]) would be eighty values long and correspond to a point in eighty-dimensional space instead.

This is nice, because preserving the space allows us to interpret the clusters (e.g., these people are very wealthy, have high blood pressure, etc) and lets us assign new examples to the existing clusters. However, it's tricky to actually visualize something with $>3$ dimensions, so plotcluster projects down to a more tractable two dimensions, which can easily be plotted.

My guess is that for matching purposes, you should go with the original centers, rather than the ones given by plotcluster. However, if you really want those, it looks like plotcluster calls discrproj internally, so you could do that yourself.

Links: