Solved – Correlated variables in kmeans clustering

k-means

I have a dataset with 3 variables: A, B and C. Now, A and B are ordinal variables (i.e.; the result of two questions measured using a 5-point Likert), whereas B is continuous.

A and B are also correlated, Spearman rho = .50, p-value = 0.0046

I want to partition my dataset in 3 cluster using kmeans (the default R implementation). Does the fact that some of the variables in my dataset are correlated violates any assumptions for running the algorithm?

Best Answer

Removing correlations is a best practise (whitening), but not required.

Non-continuous variables however tend to yield bad results with k-means, even after whitening. Due to the clearly cut gaps in non-continuous data, these gaps tend to dominate the k-means clustering result much more than any structure in continuous attributes.

Related Question