Solved – Interpreting Silhouette plot for Cluster Analysis

clusteringmixed type datar

I am running a mixed type data cluster analysis in R and I am trying to interpret the Silhouette Plot. For whatever reason, it is telling me that more clusters is ideal for analysis. Why could this be? I am using a sample of 10k with 6 variables (4 of which are categorical). enter image description here

Best Answer

Revised

This answer has been completely revised, largely in reaction to a useful comment by @Anony-Mousse in his answer. He says, "categorical data frequently does not contain clusters". I do not want to put words in his mouth, but I understand this to mean "does not contain meaningful clusters". This is to amplify that comment in the context of the question.

What I think you are doing is using Gower distance on your data and then applying some clustering algorithm. Finding the number of clusters that maximizes the average silhouette is consistent with the advice given on the Wikipedia page Determining the number of clusters in a data set.

Let me go through an example of that using just four binary categorical variables, ignoring your continuous variables. I generate some data, cluster the data using PAM on Gower distance and various values for the number of clusters. I compute the silhouette and plot the results, obtaining a graph not dissimilar from yours. Spoiler alert! The process produces misleading results.

library(cluster)

set.seed(2018)              # Happy New Year!
c1 = factor(sample(2, 1000, replace=TRUE))
c2 = factor(sample(2, 1000, replace=TRUE))
c3 = factor(sample(2, 1000, replace=TRUE))
c4 = factor(sample(2, 1000, replace=TRUE))
Cat4 = data.frame(c1,c2,c3,c4)

DM = daisy(Cat4)
SIL = sapply(2:20, function(i) {
    mean(silhouette(pam(DM, i), DM)[,3]) })
plot(c(0,SIL), type="b")

Plot of Silhouette values

I set the random seed to get a fully reproducible result, but I suggest running this a few times without the setting the seed to see that you (almost) always get a graph that peaks at 16 clusters. That must be the right number of clusters, right? No way! Notice that I generated the data at random. This is what uniform random data looks like. So why does the "silhouette method" give a clear answer of 16 clusters?

Let's look at the distance matrix.

as.matrix(DM)[1:6, 1:6]
     1    2    3    4    5    6
1 0.00 0.25 0.75 0.75 0.75 0.25
2 0.25 0.00 0.50 0.50 0.50 0.50
3 0.75 0.50 0.00 0.00 0.00 0.50
4 0.75 0.50 0.00 0.00 0.00 0.50
5 0.75 0.50 0.00 0.00 0.00 0.50
6 0.25 0.50 0.50 0.50 0.50 0.00

table(DM)
DM
     0   0.25    0.5   0.75      1 
 31044 124449 188147 124462  31398 

Given two points, they can disagree in 0,1,2,3 or 4 coordinates. Gower distance normalizes this to distances of 0, ¼, ½, ¾ or 1. These are the only possible distance values. With 4 binary categorical variables, there are 16 possible vector values. If all points with the same vector of four values are in the same cluster, then they all have distance zero from each other and distance at least 0.25 from any other point. This will make the silhouette a "perfect" 1 with 16 clusters. But again, this example is just random data. These clusters are not meaningful. For every point, the points at distance 0.25 are in another cluster even though no smaller distance between unequal points is possible. The discretization of distances encourages every distinct value to be treated as its own cluster.

I think this is what you are seeing in your graph. Of course, I am not looking at your categorical variables and I am ignoring any effect of the continuous ones. I don't even know if your categorical variables are binary or could have multiple values. But here is something worth trying. Compute the number of possible combinations of your categorical variables. If your variables are binary you should get 16. If they are not not binary, use

MaxComb = length(levels(c1)) * length(levels(c2)) *
        length(levels(c3)) * length(levels(c4))

Use whatever clustering method you have been using with MaxComb clusters. Then for each categorical variable, make a table of the cluster number vs the value of the categorical variable. Here is what happens with one variable in my example.

P16 = pam(DM, 16)
table(P16$clustering, c1)
    c1
      1  2
  1  53  0
  2  69  0
  3  71  0
  4  59  0
  5   0 64
  6   0 62
  7   0 60
  8  55  0
  9   0 52
  10  0 59
  11  0 68
  12  0 65
  13 58  0
  14 68  0
  15 65  0
  16  0 72

Notice that within each cluster, the categorical variable takes on only one value. This works with all four categorical variables. The clusters are determined by the four variables. Even including your continuous variables, does that happen with your data? If so, the discretized categorical variables are dominating the clustering process and this separation may not mean much. When you include the continuous variables, the distances won't be strictly discretized, but may fall into groups based only on the categorical variables.

Some people seem to get clustering results they are satisfied with using Gower distance. See for example K-Means clustering for mixed numeric and categorical data but I think that this discretization of distances means that interpreting the results requires a lot of caution.