R K-means – Reading K-means Data and Creating Visualizations in R

data visualizationk-meansr

I performed and plotted a kmeans analysis in R with the following commands:

 km = kmeans(t(mat2), centers = 4)
 plotcluster(t(mat2), km$cluster)      #from library(fpc)

Here is the result from the plot: enter image description here

What I want to know is how to make sense of the km$centers attribute and the plot. What I know is that km$centers is a 4 X 31 matrix. Each row represents the corresponding cluster. I think each column represents an iteration in the algorithm (correct me if I am wrong) so the final iteration and result of the algorithm for the centers would be given by:

km$centers[, 31]

 0.008785652 -0.088641371 -0.012666252 -0.079348292

I must be wrong about a lot because this leads to the following questions:

The centers given bykm$centers are not (x, y) coordinates. How do I get these (x, y) center coordinates?
The center for cluster 4 (according to the plot) must be something like (12, 2) but the above center numbers do not reflect any of these coordinates. In fact every number in the 4 X 31 matrix is less than 1. So, what is the relationship between km$centers and the plot?

The ultimate goal here is to create a matching (not mentioned here) based upon the (x, y) coordinates.

All help is greatly appreciated!

Best Answer

I think you're getting hung up on the difference between the center of the actual cluster vs. the center of the 1s, 2s, etc. on your plot.

The actual center of your cluster is in a high-dimensional space, where the number of dimensions is determined by the number of attributes you're using for clustering. For example, if your data has 100 rows and 8 columns, then kmeans interprets that has having 100 examples to cluster, each of which has eight attributes. Suppose you call:

km = kmeans(myData, 4)

Then, km$centers will be a matrix with four rows and eight columns. The center of cluster #1 is in km$centers[1,:]--the eight values there give its position in the 8-D space. Cluster #2's center is in km$centers[2,:] and so on. If you had eighty attributes instead, then each center (e.g., km$centers[1,:], km$centers[2,:]) would be eighty values long and correspond to a point in eighty-dimensional space instead.

This is nice, because preserving the space allows us to interpret the clusters (e.g., these people are very wealthy, have high blood pressure, etc) and lets us assign new examples to the existing clusters. However, it's tricky to actually visualize something with $>3$ dimensions, so plotcluster projects down to a more tractable two dimensions, which can easily be plotted.

My guess is that for matching purposes, you should go with the original centers, rather than the ones given by plotcluster. However, if you really want those, it looks like plotcluster calls discrproj internally, so you could do that yourself.

Links:

FPC Package Documentation, where I read about plotcluster and discrproj
K Means Documentation (R)

Related Solutions

Solved – R getting 2D coordinates from kmeans

First, let's generate some example data and cluster it:

data <- rFace(1000) 
km <- kmeans(data, 6)

Now, we can use discrproj to find an appropriate projection that separates these clusters

dp = discrproj(data, km$clustering)

The result, dp has several fields that are potentially useful. The field dp$proj contains the coordinates of the original data points, projected onto our new space. This space has the same dimensionality as the original space, but the first two dimensions separate the clusters best (which is what plotcluster actually displays)

Compare:

plot(dp$proj[,1], dp$proj[,2], pch=km$cluster+48, col=km$cluster) #+48 to get labels correct

with:

plotcluster(data, km$clustering)

Suppose you get some new points in your original space. You can project them into your new space using the basis vectors in dp$units, like this:

newpts = newdata %*% dp$units[,1:2]

That should answer your first question. Unfortunately, I think the second part is effectively unanswerable because there are infinitely many points in the 31-d space that correspond to a given point in the 2D space.

Solved – K-means initial centers membership

In kmeans help you can read that there is centers argument that is

either the number of clusters, say k, or a set of initial (distinct) cluster centres. If a number, a random set of (distinct) rows in x is chosen as the initial centres.

So knowing that kmeans starts with allocating cluster centers randomly you can (1) choose by hand some random centers, (2) start the algorithm with iter.max=1 (single iteration), save results and (3) start it again with the previous output as centers. Next you repeat (2) and (3) it until convergence and you have your data.

Generally there is no point in recording the initial values since they are random, so they are not recorded.

Best Answer

Related Solutions

Solved – R getting 2D coordinates from kmeans

Solved – K-means initial centers membership

Related Question