Solved – Determining number of clusters with SSE scree plot with Gower’s coefficient of similarity

clusteringgower-similarity

I am researching cluster analysis, and I am interested in variables that are both categorical and continuous, for which I have read that a Gower's similarity coefficient is a good proximity measure. I am interested in first using an average linkage algorithm, and have found that some have recommended looking for the 'elbow' in the sum of squared error (SSE) scree plot as a guideline for deciding how many clusters to retain. I was wondering if the Gower's similarity coefficient (being non-metric and non-Euclidean) would allow me to create an SSE scree plot, or if that didn't make sense statistically.

Best Answer

SSE is the measure optimized by k-means.

It doesn't make much sense for any other algorithm than k-means. And even there it suffers from the fact that increasing k will decrease SSE, so you can mostly look at which point further increasing k stops yielding a substantial increase in SSE - that is essentially the vague "elbow method".

There exist other criteria such as Silhouette, Davies-Bouldin index, BIC, AIC that can be used to get an "alternative view" of what is actually optimal.

But in the end, that is just a mathematical heuristic. It may not work for real data.