Solved – Kmeans: Whether to standardise? Can you use categorical variables? Is Cluster 3.0 suitable

clusteringk-means

I am running kmeans for a market research study, and I have a couple of questions:

Should I be standardizing my data, and if so, how? For example, one variable I have is product demand, which is measured on a seven point scale. On the other hand, I also have a variable on age, which is a very different scale. Should I be standardizing these, and how?
Can I use categorical variables in kmeans? Specifically, I would like to use gender and ethnicity. If it is possible, how would I prepare this data for the cluster analysis? I suppose I would assign numbers to them, but how would I standardize these with my other data?
I downloaded the open source software Cluster 3.0. Is this a good one to use?

Best Answer

First of all: yes: standardization is a must unless you have a strong argument why it is not necessary. Probably try z scores first.

Discrete data is a larger issue. K-means is meant for continuous data. The mean will not be discrete, so the cluster centers will likely be anomalous. You have a high chance that the clustering algorithms ends up discovering the discreteness of your data, instead of a sensible structure.

Categorical variables are worse. K-means can't handle them at all; a popular hack is to turn them into multiple binary variables (male, female). This will however expose above problems just at an even worse scale, because now it's multiple highly correlated binary variables.

Since you apparently are dealing with survey data, consider using hierarchical clustering. With an appropriate distance function, it can deal with all above issues. You just need to spend some effort on finding a good measure of similarity.

Cluster 3.0 - I have never even seen it. I figure it is an okay choice for non data science people. Probably similar to other tools such as Matlab. It will be missing all the modern algorithms, but it probably has an easy to use user interface.

Related Solutions

Clustering Algorithms – How to Determine If Data Is Clustered Enough

About k-means specifically, you can use the Gap statistics. Basically, the idea is to compute a goodness of clustering measure based on average dispersion compared to a reference distribution for an increasing number of clusters. More information can be found in the original paper:

Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimating the numbers of clusters in a data set via the gap statistic. J. R. Statist. Soc. B, 63(2): 411-423.

The answer that I provided to a related question highlights other general validity indices that might be used to check whether a given dataset exhibits some kind of a structure.

When you don't have any idea of what you would expect to find if there was noise only, a good approach is to use resampling and study clusters stability. In other words, resample your data (via bootstrap or by adding small noise to it) and compute the "closeness" of the resulting partitions, as measured by Jaccard similarities. In short, it allows to estimate the frequency with which similar clusters were recovered in the data. This method is readily available in the fpc R package as clusterboot(). It takes as input either raw data or a distance matrix, and allows to apply a wide range of clustering methods (hierarchical, k-means, fuzzy methods). The method is discussed in the linked references:

Hennig, C. (2007) Cluster-wise assessment of cluster stability. Computational Statistics and Data Analysis, 52, 258-271.

Hennig, C. (2008) Dissolution point and isolation robustness: robustness criteria for general cluster analysis methods. Journal of Multivariate Analysis, 99, 1154-1176.

Below is a small demonstration with the k-means algorithm.

sim.xy <- function(n, mean, sd) cbind(rnorm(n, mean[1], sd[1]),
rnorm(n, mean[2],sd[2]))
xy <- rbind(sim.xy(100, c(0,0), c(.2,.2)),
            sim.xy(100, c(2.5,0), c(.4,.2)),
            sim.xy(100, c(1.25,.5), c(.3,.2)))
library(fpc)
km.boot <- clusterboot(xy, B=20, bootmethod="boot",
                       clustermethod=kmeansCBI,
                       krange=3, seed=15555)

The results are quite positive in this artificial (and well structured) dataset since none of the three clusters (krange) were dissolved across the samples, and the average clusterwise Jaccard similarity is > 0.95 for all clusters.

Below are the results on the 20 bootstrap samples. As can be seen, statistical units tend to stay grouped into the same cluster, with few exceptions for those observations lying in between.

enter image description here

You can extend this idea to any validity index, of course: choose a new series of observations by bootstrap (with replacement), compute your statistic (e.g., silhouette width, cophenetic correlation, Hubert's gamma, within sum of squares) for a range of cluster numbers (e.g., 2 to 10), repeat 100 or 500 times, and look at the boxplot of your statistic as a function of the number of cluster.

Here is what I get with the same simulated dataset, but using Ward's hierarchical clustering and considering the cophenetic correlation (which assess how well distance information are reproduced in the resulting partitions) and silhouette width (a combination measure assessing intra-cluster homogeneity and inter-cluster separation).

The cophenetic correlation ranges from 0.6267 to 0.7511 with a median value of 0.7031 (500 bootstrap samples). Silhouette width appears to be maximal when we consider 3 clusters (median 0.8408, range 0.7371-0.8769).

enter image description here

Solved – Can we use cluster analysis in multiple regression

Since you have labeled data, a supervised approach will usually outdo any unsupervised approach.

I agree with Peter Flom, who in the comments noted that you "misunderstand the purpose of cluster analysis". Clusters are not meant to find regions with maximum prices.

Chances are that by partitioning your data into clusters without paying attention to price, your multiple regeression approach will be worse, because it only sees part of the data; and it may have discontinuities at the borders. But in other cases, exactly this can help.

Why not just give it a try and see for yourself? But beware of overfitting, don't increase the number of variables too much.

Best Answer

Related Solutions

Clustering Algorithms – How to Determine If Data Is Clustered Enough

Solved – Can we use cluster analysis in multiple regression

Related Question