Clustering – Understanding Comparisons of Clustering Results in R

clusteringr

I'm experimenting with classifying data into groups. I'm quite new to this topic, and trying to understand the output of some of the analysis.

Using examples from Quick-R, several R packages are suggested. I have tried using two of these packages (fpc using the kmeans function, and mclust). One aspect of this analysis that I do not understand is the comparison of the results.

# comparing 2 cluster solutions
library(fpc)
cluster.stats(d, fit1$cluster, fit2$cluster)

I've read through the relevant parts of the fpc manual and am still not clear on what I should be aiming for. For example, this is the output of comparing two different clustering approaches:

$n
[1] 521

$cluster.number
[1] 4

$cluster.size
[1] 250 119  78  74

$diameter
[1]  5.278162  9.773658 16.460074  7.328020

$average.distance
[1] 1.632656 2.106422 3.461598 2.622574

$median.distance
[1] 1.562625 1.788113 2.763217 2.463826

$separation
[1] 0.2797048 0.3754188 0.2797048 0.3557264

$average.toother
[1] 3.442575 3.929158 4.068230 4.425910

$separation.matrix
          [,1]      [,2]      [,3]      [,4]
[1,] 0.0000000 0.3754188 0.2797048 0.3557264
[2,] 0.3754188 0.0000000 0.6299734 2.9020383
[3,] 0.2797048 0.6299734 0.0000000 0.6803704
[4,] 0.3557264 2.9020383 0.6803704 0.0000000

$average.between
[1] 3.865142

$average.within
[1] 1.894740

$n.between
[1] 91610

$n.within
[1] 43850

$within.cluster.ss
[1] 1785.935

$clus.avg.silwidths
         1          2          3          4 
0.42072895 0.31672350 0.01810699 0.23728253 

$avg.silwidth
[1] 0.3106403

$g2
NULL

$g3
NULL

$pearsongamma
[1] 0.4869491

$dunn
[1] 0.01699292

$entropy
[1] 1.251134

$wb.ratio
[1] 0.4902123

$ch
[1] 178.9074

$corrected.rand
[1] 0.2046704

$vi
[1] 1.56189

My primary question here is to better understand how to interpret the results of this cluster comparison.


Previously, I had asked more about the effect of scaling data, and calculating a distance matrix. However that was answered clearly by mariana soffer, and I'm just reorganizing my question to emphasize that I am interested in the intrepretation of my output which is a comparison of two different clustering algorithms.

Previous part of question:
If I am doing any type of clustering, should I always scale data? For example, I am using the function dist() on my scaled dataset as input to the cluster.stats() function, however I don't fully understand what is going on. I read about dist() here and it states that:

this function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.

Best Answer

First let me tell you that I am not going to explain exactly all the measures here, but I am going to give you an idea about how to compare how good the clustering methods are (let's assume we are comparing 2 clustering methods with the same number of clusters).

  1. For example the bigger the diameter of the cluster, the worst the clustering, because the points that belong to the cluster are more scattered.
  2. The higher the average distance of each clustering, the worst the clustering method. (Let's assume that the average distance is the average of the distances from each point in the cluster to the center of the cluster.)

These are the two metrics that are the most used. Check these links to understand what they stand for:

  • inter-cluster distance (the higher the better, is the summatory of the distance between the different cluster centroids)
  • intra-cluster distance (the lower the better, is the summatory of the distance between the cluster members to the center of the cluster)

To better understanding the metrics above, check this.

Then you should read the manual of the library and functions you are using to understand which measures represent each of these, or if these are not included try to find the meaning of the included. However, I would not bother and stick with the ones I stated here.

Let's go on with the questions you made:

  1. Regarding scaling data: Yes you should always scale the data for clustering, otherwise the different scales of the different dimensions (variables) will have different influences in how the data are clustered, with the higher the values in the variable, the more influential that variable will be in how the clustering is done, while indeed they should all have the same influence (unless for some particular strange reason you do not want it that way).
  2. The distance functions compute all the distances from one point (instance) to another. The most common distance measure is Euclidean, so for example, let's suppose you want to measure the distance from instance 1 to instance 2 (let's assume you only have 2 instances for the sake of simplicity). Also let's assume that each instance has 3 values (x1, x2, x3), so I1=0.3, 0.2, 0.5 and I2=0.3, 0.3, 0.4 so the Euclidean distance from I1 and I2 would be: sqrt((0.3-0.2)^2+(0.2-0.3)^2+(0.5-0.4)^2)=0.17, hence the distance matrix will result in:

        i1    i2
    i1  0     0.17
    i2  0.17  0
    

Notice that the distance matrix is always symmetrical.

The Euclidean distance formula is not the only one that exists. There are many other distances that can be used to calculate this matrix. Check for example in Wikipedia Manhattain Distance and how to calculate it. At the end of the Wikipedia page for Euclidean Distance (where you can also check its formula) you can check which other distances exist.

Related Question