I'm experimenting with classifying data into groups. I'm quite new to this topic, and trying to understand the output of some of the analysis.
Using examples from Quick-R, several R
packages are suggested. I have tried using two of these packages (fpc
using the kmeans
function, and mclust
). One aspect of this analysis that I do not understand is the comparison of the results.
# comparing 2 cluster solutions
library(fpc)
cluster.stats(d, fit1$cluster, fit2$cluster)
I've read through the relevant parts of the fpc
manual and am still not clear on what I should be aiming for. For example, this is the output of comparing two different clustering approaches:
$n
[1] 521
$cluster.number
[1] 4
$cluster.size
[1] 250 119 78 74
$diameter
[1] 5.278162 9.773658 16.460074 7.328020
$average.distance
[1] 1.632656 2.106422 3.461598 2.622574
$median.distance
[1] 1.562625 1.788113 2.763217 2.463826
$separation
[1] 0.2797048 0.3754188 0.2797048 0.3557264
$average.toother
[1] 3.442575 3.929158 4.068230 4.425910
$separation.matrix
[,1] [,2] [,3] [,4]
[1,] 0.0000000 0.3754188 0.2797048 0.3557264
[2,] 0.3754188 0.0000000 0.6299734 2.9020383
[3,] 0.2797048 0.6299734 0.0000000 0.6803704
[4,] 0.3557264 2.9020383 0.6803704 0.0000000
$average.between
[1] 3.865142
$average.within
[1] 1.894740
$n.between
[1] 91610
$n.within
[1] 43850
$within.cluster.ss
[1] 1785.935
$clus.avg.silwidths
1 2 3 4
0.42072895 0.31672350 0.01810699 0.23728253
$avg.silwidth
[1] 0.3106403
$g2
NULL
$g3
NULL
$pearsongamma
[1] 0.4869491
$dunn
[1] 0.01699292
$entropy
[1] 1.251134
$wb.ratio
[1] 0.4902123
$ch
[1] 178.9074
$corrected.rand
[1] 0.2046704
$vi
[1] 1.56189
My primary question here is to better understand how to interpret the results of this cluster comparison.
Previously, I had asked more about the effect of scaling data, and calculating a distance matrix. However that was answered clearly by mariana soffer, and I'm just reorganizing my question to emphasize that I am interested in the intrepretation of my output which is a comparison of two different clustering algorithms.
Previous part of question:
If I am doing any type of clustering, should I always scale data? For example, I am using the function dist()
on my scaled dataset as input to the cluster.stats()
function, however I don't fully understand what is going on. I read about dist()
here and it states that:
this function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.
Best Answer
First let me tell you that I am not going to explain exactly all the measures here, but I am going to give you an idea about how to compare how good the clustering methods are (let's assume we are comparing 2 clustering methods with the same number of clusters).
These are the two metrics that are the most used. Check these links to understand what they stand for:
To better understanding the metrics above, check this.
Then you should read the manual of the library and functions you are using to understand which measures represent each of these, or if these are not included try to find the meaning of the included. However, I would not bother and stick with the ones I stated here.
Let's go on with the questions you made:
The distance functions compute all the distances from one point (instance) to another. The most common distance measure is Euclidean, so for example, let's suppose you want to measure the distance from instance 1 to instance 2 (let's assume you only have 2 instances for the sake of simplicity). Also let's assume that each instance has 3 values
(x1, x2, x3)
, soI1=0.3, 0.2, 0.5
andI2=0.3, 0.3, 0.4
so the Euclidean distance from I1 and I2 would be:sqrt((0.3-0.2)^2+(0.2-0.3)^2+(0.5-0.4)^2)=0.17
, hence the distance matrix will result in:Notice that the distance matrix is always symmetrical.
The Euclidean distance formula is not the only one that exists. There are many other distances that can be used to calculate this matrix. Check for example in Wikipedia Manhattain Distance and how to calculate it. At the end of the Wikipedia page for Euclidean Distance (where you can also check its formula) you can check which other distances exist.