Solved – Kmeans: Whether to standardise? Can you use categorical variables? Is Cluster 3.0 suitable

clusteringk-means

I am running kmeans for a market research study, and I have a couple of questions:

  1. Should I be standardizing my data, and if so, how? For example, one variable I have is product demand, which is measured on a seven point scale. On the other hand, I also have a variable on age, which is a very different scale. Should I be standardizing these, and how?

  2. Can I use categorical variables in kmeans? Specifically, I would like to use gender and ethnicity. If it is possible, how would I prepare this data for the cluster analysis? I suppose I would assign numbers to them, but how would I standardize these with my other data?

  3. I downloaded the open source software Cluster 3.0. Is this a good one to use?

Best Answer

First of all: yes: standardization is a must unless you have a strong argument why it is not necessary. Probably try z scores first.

Discrete data is a larger issue. K-means is meant for continuous data. The mean will not be discrete, so the cluster centers will likely be anomalous. You have a high chance that the clustering algorithms ends up discovering the discreteness of your data, instead of a sensible structure.

Categorical variables are worse. K-means can't handle them at all; a popular hack is to turn them into multiple binary variables (male, female). This will however expose above problems just at an even worse scale, because now it's multiple highly correlated binary variables.

Since you apparently are dealing with survey data, consider using hierarchical clustering. With an appropriate distance function, it can deal with all above issues. You just need to spend some effort on finding a good measure of similarity.

Cluster 3.0 - I have never even seen it. I figure it is an okay choice for non data science people. Probably similar to other tools such as Matlab. It will be missing all the modern algorithms, but it probably has an easy to use user interface.

Related Question