Solved – Correlated variables in kmeans clustering

k-means

I have a dataset with 3 variables: A, B and C. Now, A and B are ordinal variables (i.e.; the result of two questions measured using a 5-point Likert), whereas B is continuous.

A and B are also correlated, Spearman rho = .50, p-value = 0.0046

I want to partition my dataset in 3 cluster using kmeans (the default R implementation). Does the fact that some of the variables in my dataset are correlated violates any assumptions for running the algorithm?

Best Answer

Removing correlations is a best practise (whitening), but not required.

Non-continuous variables however tend to yield bad results with k-means, even after whitening. Due to the clearly cut gaps in non-continuous data, these gaps tend to dominate the k-means clustering result much more than any structure in continuous attributes.

Related Solutions

Solved – Clustering a dataset with both discrete and continuous variables

So you've been told you need an appropriate distance measure. Here are some leads:

and, of course: Mahalanobis distance.

R K-means – Reading K-means Data and Creating Visualizations in R

I think you're getting hung up on the difference between the center of the actual cluster vs. the center of the 1s, 2s, etc. on your plot.

The actual center of your cluster is in a high-dimensional space, where the number of dimensions is determined by the number of attributes you're using for clustering. For example, if your data has 100 rows and 8 columns, then kmeans interprets that has having 100 examples to cluster, each of which has eight attributes. Suppose you call:

km = kmeans(myData, 4)

Then, km$centers will be a matrix with four rows and eight columns. The center of cluster #1 is in km$centers[1,:]--the eight values there give its position in the 8-D space. Cluster #2's center is in km$centers[2,:] and so on. If you had eighty attributes instead, then each center (e.g., km$centers[1,:], km$centers[2,:]) would be eighty values long and correspond to a point in eighty-dimensional space instead.

This is nice, because preserving the space allows us to interpret the clusters (e.g., these people are very wealthy, have high blood pressure, etc) and lets us assign new examples to the existing clusters. However, it's tricky to actually visualize something with $>3$ dimensions, so plotcluster projects down to a more tractable two dimensions, which can easily be plotted.

My guess is that for matching purposes, you should go with the original centers, rather than the ones given by plotcluster. However, if you really want those, it looks like plotcluster calls discrproj internally, so you could do that yourself.

Links:

FPC Package Documentation, where I read about plotcluster and discrproj
K Means Documentation (R)

Best Answer

Related Solutions

Solved – Clustering a dataset with both discrete and continuous variables

R K-means – Reading K-means Data and Creating Visualizations in R

Related Question