I think you're getting hung up on the difference between the center of the actual cluster vs. the center of the 1s, 2s, etc. on your plot.
The actual center of your cluster is in a high-dimensional space, where the number of dimensions is determined by the number of attributes you're using for clustering. For example, if your data has 100 rows and 8 columns, then kmeans
interprets that has having 100 examples to cluster, each of which has eight attributes. Suppose you call:
km = kmeans(myData, 4)
Then, km$centers
will be a matrix with four rows and eight columns. The center of cluster #1 is in km$centers[1,:]
--the eight values there give its position in the 8-D space. Cluster #2's center is in km$centers[2,:]
and so on. If you had eighty attributes instead, then each center (e.g., km$centers[1,:]
, km$centers[2,:]
) would be eighty values long and correspond to a point in eighty-dimensional space instead.
This is nice, because preserving the space allows us to interpret the clusters (e.g., these people are very wealthy, have high blood pressure, etc) and lets us assign new examples to the existing clusters. However, it's tricky to actually visualize something with $>3$ dimensions, so plotcluster
projects down to a more tractable two dimensions, which can easily be plotted.
My guess is that for matching purposes, you should go with the original centers, rather than the ones given by plotcluster
. However, if you really want those, it looks like plotcluster
calls discrproj
internally, so you could do that yourself.
Links:
Best Answer
Removing correlations is a best practise (whitening), but not required.
Non-continuous variables however tend to yield bad results with k-means, even after whitening. Due to the clearly cut gaps in non-continuous data, these gaps tend to dominate the k-means clustering result much more than any structure in continuous attributes.