Machine Learning – Comparing Silhouette Score vs Distortion Score in Clustering

clusteringdata miningk-meansmachine learningunsupervised learning

I am working on segmenting my customers with clustering. My dataset size is 7315 rows and 30 features.

So, as a beginner to clustering, I passed all my 29 features (excluding id column) to the cluster.

So, I did the below two approach to find best K

Approach 1 – Ran a for loop and computed the silhouette score to identify optimal K

Approach 2 – Found out distortion score using elbow visualizer

My question is as follows,

a) when elbow visualizer shows that k=6 is the optimal number of clusters, why is my silhouette score less? You can see that silhouette score is high for cluster 2 but for cluster 6, it is low. Is this expected?

I was expecting to see high silhouette score for n_cluster = 6.

b) what is the difference between distortion score and silhouoette score?

You can find details about the code below

Please find below

enter image description here

enter image description here

Best Answer

As you can read here , the distortion score is "...the average of the squared distances from the cluster centers of the respective clusters."

Distortion and silhouette are simply different measures. Distortion is taking into account ONLY the tightness of the cluster (so, distortion goes down when "average of the squared distances [between each point in a cluster and the cluster center]"). The closer all the points in a cluster are to the center of that same cluster, the lower the distortion.

However, silhouette ALSO takes into account the distances between points of one cluster and NEAREST cluster center. Meaning that in order to have a good silhouette score, clusters generally need to be tighter and farther from each other.

An extreme example to illustrate the difference: If you have two clusters that are super tight, but the clusters are super close/right on top of each other, this would have a good distortion score (because the clusters are tight, nothing else is considered) but a bad silhouette score (because the clusters are right on top of each other, so the clusters themselves are hard to tell apart from one another).

I hope this makes sense!

I would recommend visualizing your data to see which metric you think is more appropriate for your situation.

Related Question