Solved – Compare clustering results based on intra cluster similarity

clusteringk medoidsk-means

I am working on a project for my university. A part of this project is to compare the influence of PCA on clustering. Therefore I have a football player dataset that contains a feature called "position group" which contains groups from 1 to 3. E.g. the heavy line players are in group 1, lighter receivers, cornerbacks etc are in group 2 and so on. Now I have to generate clusters with k-means and k-medoids based on 16 features that are fitness exercise results and body composition measurements like size and weight from each player.
For this I use k = 3 because there are 3 player groups in the dataset. Goal of the clustering is to determine an "optimal theoretical" player allocation to a specific group so that I can say something like this: "3 Wide Receivers changed to the group of the heavy line men based on the clustering results. This could be an indication of a wrong position allocation from the coach. The coach should check this."
For every algorithm I use the same dataset with applied PCA and without applied PCA. That means I have 4 results in total.

Now I want to compare the results. I compare the clusters with the original data by using the rand index.
The methods do not differ a lot:

Algorithm               Similarity to original clusters
K-means without PCA     0,514
K-means with PCA        0,544
K-medoids without PCA   0,528
K-medoids with PCA      0,532

Furhermore I use the intra- and inter-cluster similarity measures. The intra cluster distances are the following:

    Algorithm          Cluster 1    Cluster 2   Cluster 3
K-means without PCA    2,452        2,341       2,675
K-means with PCA       2,324        2,216       1,560
K-medoids without PCA  2,166        2,828       2,320
K-medoids with PCA     1,968        2,642       2,420

What is the best way to determine the best result especially for the intra cluster distances? Should I calculate the sum of each method and the smallest sum is the best approach?

Best Answer

You must not expect clusters to agree with your known classes.

This may hold, but it does not necessarily hold.

It is a perfectly valid clustering result if it grouped your football players into three clusters that correspond to e.g.:

  • blonde hair
  • brunette hair
  • black hair

This is perfectly reasonable for an unsupervised method.

If you have a predefined task such as the three groups of players, then you should use a classifier instead.

Related Question