Solved – Dissimilarity Matrix – Number of cluster

clusteringr

I currently try to figure out if a method like elbow-method, silhouette average or gap statistic can be applied to a dissimilarity matrix. My matrix contains 100 x 100 objects and it satisfies the triangle inequality. So it has a metric but the distances are not Euclidean. My question is, can I use one of these mentioned methods or is there another method how I can determine a number of cluster with my matrix. I don't have any other data available.

I'm using R for the clustering.

Thanks a lot!

Best Answer

I interpret your question to mean that you have the dissimilarity matrix, but do not have the actual points that were used to generate the matrix. Can one use only the dissimilarity matrix (not the points) to get the number of clusters?

When you say elbow method, I understand that to mean that you will compute SSE = sum of squared distances from points within each cluster to the cluster center. Since the cluster center is in general not one of the points (and therefore not in your matrix), you cannot compute this without access to the points.

Similarly, the GAP statistic uses within cluster SSE and so cannot be computed without access to the original data.

However, silhouette uses only distances between points in the original data, no cluster centers, so all the information that you need is in your distance matrix. Here is an example of using silhouette using only the distance matrix. I start by using hclust on the distance matrix to get a hierarchical clustering

library(cluster)
DM = as.matrix(dist(ruspini))
HC = hclust(as.dist(DM), method="single")

This looks a little silly. I have converted a distance object to a full dissimilarity matrix and then converted it back to a distance object. I did this because your question asks about using a dissimilarity matrix and I wanted to start from that point. Now let's compute the average silhouette using various numbers of clusters.

## Silhouette
plot(2:10, sapply(2:10, function(i) { 
   mean(silhouette(cutree(HC, i), dmatrix=DM)[,"sil_width"]) }),
   xlab="Number of clusters", ylab="Average Silhouette", type="b", pch=20)

Average silhouettes

This suggests that there should be four clusters - the value with the highest silhouette.

Related Question