Solved – Computation of C-index for cluster validation

clustering

How to calculate the C-index (an internal cluster validity index)? Please explain it with a small example. (I need the background calculation, i.e., how the pair of points in the cluster, minimum sum and maximum sum are used in the calculation).

Here is what I tried myself.
Let data x={1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}; using kmeans, k=3 i got {1,2,3,4},{5,6,7,8,9},{10,11,12,13,14,15} are as clusters.

When I find the distance of x

   1 2 3 4 5 6 7 8 9 10 11 12 13 14
2  0                              
3  0 0                             
4  0 0 0                           
5  1 1 1 1                         
6  1 1 1 1 0                       
7  1 1 1 1 0 0                     
8  1 1 1 1 0 0 0                   
9  1 1 1 1 0 0 0 0                 
10 1 1 1 1 2 2 2 2 2               
11 1 1 1 1 2 2 2 2 2  0            
12 1 1 1 1 2 2 2 2 2  0  0         
13 1 1 1 1 2 2 2 2 2  0  0  0      
14 1 1 1 1 2 2 2 2 2  0  0  0  0   
15 1 1 1 1 2 2 2 2 2  0  0  0  0  0
15 1 1 1 1 2 2 2 2 2  0  0  0  0  0 

 and distance of clusters is  
1 2 3 4 5 6 7 8 9 10 11 12 13 14  
2  0                               
3  0 0                             
4  0 0 0                           
5  1 1 1 1                         
6  1 1 1 1 0                       
7  1 1 1 1 0 0                     
8  1 1 1 1 0 0 0                   
9  1 1 1 1 0 0 0 0                 
10 1 1 1 1 2 2 2 2 2               
11 1 1 1 1 2 2 2 2 2  0            
12 1 1 1 1 2 2 2 2 2  0  0         
13 1 1 1 1 2 2 2 2 2  0  0  0      
14 1 1 1 1 2 2 2 2 2  0  0  0  0   
15 1 1 1 1 2 2 2 2 2  0  0  0  0  0

<code>
 sum(dist(x))
[1] 560
 sum(dist(y$cluster))
[1] 104
> max(sum(dist(y$cluster)))
[1] 104
> min(sum(dist(y$cluster)))
[1] 104
> cindex<-(560-140)/(140-140)
> cindex
[1] Inf
</code>

I got like this.
but when i am using the package of c-inedx in Rstudio it shows like this

<code> 
> z<-intCriteria(x,y$cluster,c("c_index"))
Error in intCriteria(x, y$cluster, c("c_index")) : 
  argument 'traj' must be a matrix
> x<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
> x<-matrix(x,ncol=1)
> y<-kmeans(x,3)
> z<-intCriteria(x,y$cluster,c("c_index"))
> z
$c_index
[1] 0.04489796.
</code>

I want to know whether my above process is right is or wrong? Or please show me how to compute C-Index.

Best Answer

Below is an excerpt from my document on my SPSS macro function computing C-index internal clustering criterion [see my web-page, "Clustering criterions" collection]:

C-Index (Hubert, L.J., Levin, J.R. A general statistical framework for assessing categorical clustering in free recall // Psychol. Bull., 1976, 83, 1072-1080) is a universal criterion like point-biserial correlation. It shows how much the observed within-cluster density – the sum of within-cluster pairwise proximities – is relatively close to such sum utmost maximal (mostly actually unattainable) under the given number of within-cluster proximities.

$(S-S_{min})/(S_{max}-S_{min})$

where S is the summed distance between same-cluster objects (there are fw such distances); Smin is the sum of fw least distances between objects (no matter within-cluster or between-cluster a distance is), Smax is, analogously, the sum of fw greatest distances between objects. C-Index varies between 0 and 1. The lower is the value the better is the cluster partition.

C-Index is linearly identical (but with a negative slope coefficient) to point-biserial correlation (see !RPBCLU macro) rescaled relative its empirical maximum. I.e., if rpb=.57, say, but in the data with the given partition being considered rpb cannot exceed, say, .95, then the rescaled rpb=.57/.95. C-Index criterion is the magnitude which is equivalent to this one; thereby, rescaled correlation and C-Index are the same thing, in contents. C-Index tends to prefer more clusters than rpb.

Like point-biserial correlation, C-Index is insensitive to linear transformation of proximities. Such indifference to proximity size is counter-intuitive in a specific geometrical sense and may sometimes be regarded a drawback.

In euclidean space using euclidean distances the index’s attitude to ellipsoid clusters in comparison with spherical ones is similar to that of point-biserial correlation (see !RPBCLU description): with the increase of space dimensionality there strengthens the preference for spherical clusters. By my tentative impressions in simulation trials, C-Index punishes chains stronger (see picture in !RPBCLU description) than it rewards piles, while point-biserial r, contrary, stronger rewards the second than it punishes the first. And in regard of round clusters, both criterions often give very similar results.

Let me now show how you can compute C-index of a cluster solution. You will need the distance matrix and the result of clustering. I'll use Iris data, 5 cases from each class, clustered by K-means (CLU = cluster code):

  id SLength SWidth PLength PWidth species       CLU

   1    5.1    3.5     1.4     .2  setosa          1
   2    4.9    3.0     1.4     .2  setosa          1
   3    4.7    3.2     1.3     .2  setosa          1
   4    4.6    3.1     1.5     .2  setosa          1
   5    5.0    3.6     1.4     .2  setosa          1
  51    7.0    3.2     4.7    1.4  versicolor      2
  52    6.4    3.2     4.5    1.5  versicolor      3
  53    6.9    3.1     4.9    1.5  versicolor      2
  54    5.5    2.3     4.0    1.3  versicolor      3
  55    6.5    2.8     4.6    1.5  versicolor      3
 101    6.3    3.3     6.0    2.5  virginica       2
 102    5.8    2.7     5.1    1.9  virginica       3
 103    7.1    3.0     5.9    2.1  virginica       2
 104    6.3    2.9     5.6    1.8  virginica       2
 105    6.5    3.0     5.8    2.2  virginica       2

Matrix D of euclidean distances between the 15 cases:

  .000  .539  .510  .648  .141 4.004 3.617 4.164 3.094 3.792 5.285 4.208 5.302 4.690 5.057
  .539  .000  .300  .332  .608 4.096 3.686 4.237 2.970 3.812 5.339 4.181 5.357 4.709 5.091
  .510  .300  .000  .245  .510 4.277 3.850 4.416 3.154 3.997 5.473 4.335 5.529 4.868 5.247
  .648  .332  .245  .000  .648 4.177 3.734 4.306 2.985 3.873 5.336 4.177 5.406 4.722 5.110
  .141  .608  .510  .648  .000 4.061 3.663 4.219 3.148 3.850 5.313 4.246 5.351 4.731 5.096
 4.004 4.096 4.277 4.177 4.061  .000  .640  .265 1.887  .656 1.844 1.449 1.407 1.245 1.463
 3.617 3.686 3.850 3.734 3.663  .640  .000  .648 1.382  .424 1.808 1.063 1.688 1.183 1.493
 4.164 4.237 4.416 4.306 4.219  .265  .648  .000 1.857  .583 1.616 1.253 1.187  .990 1.212
 3.094 2.970 3.154 2.985 3.148 1.887 1.382 1.857  .000 1.285 2.661 1.349 2.702 1.952 2.354
 3.792 3.812 3.997 3.873 3.850  .656  .424  .583 1.285  .000 1.803  .954 1.565 1.068 1.404
 5.285 5.339 5.473 5.336 5.313 1.844 1.808 1.616 2.661 1.803  .000 1.334  .949  .900  .510
 4.208 4.181 4.335 4.177 4.246 1.449 1.063 1.253 1.349  .954 1.334  .000 1.568  .742 1.077
 5.302 5.357 5.529 5.406 5.351 1.407 1.688 1.187 2.702 1.565  .949 1.568  .000  .911  .616
 4.690 4.709 4.868 4.722 4.731 1.245 1.183  .990 1.952 1.068  .900  .742  .911  .000  .500
 5.057 5.091 5.247 5.110 5.096 1.463 1.493 1.212 2.354 1.404  .510 1.077  .616  .500  .000

Calculation of C-Index.

1.
Take column CLU, propagate it horizontally into 15x15 matrix CLUM.
Then compute CLUM = CLUM=t(CLUM) to obtain square symmetric indicator matrix
(1 = in the same cluster; 0 = in different clusters), and set 0 on the diagonal:
CLUM
  0  1  1  1  1  0  0  0  0  0  0  0  0  0  0
  1  0  1  1  1  0  0  0  0  0  0  0  0  0  0
  1  1  0  1  1  0  0  0  0  0  0  0  0  0  0
  1  1  1  0  1  0  0  0  0  0  0  0  0  0  0
  1  1  1  1  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  1  0  0  1  0  1  1  1
  0  0  0  0  0  0  0  0  1  1  0  1  0  0  0
  0  0  0  0  0  1  0  0  0  0  1  0  1  1  1
  0  0  0  0  0  0  1  0  0  1  0  1  0  0  0
  0  0  0  0  0  0  1  0  1  0  0  1  0  0  0
  0  0  0  0  0  1  0  1  0  0  0  0  1  1  1
  0  0  0  0  0  0  1  0  1  1  0  0  0  0  0
  0  0  0  0  0  1  0  1  0  0  1  0  0  1  1
  0  0  0  0  0  1  0  1  0  0  1  0  1  0  1
  0  0  0  0  0  1  0  1  0  0  1  0  1  1  0

2.
Compute sum of within-cluster distances SW = msum(D &* CLUM)/2
[&* denotes elementwise multiplication, and msum is matrix sum]
Compute number of within-cluster distances NW = msum(CLUM)/2
Compute number of between-cluster distances NB = n*(n-1)/2-NW
[n is the number of objects, size of CLUM, 15]

3.
Unwrap one triangle of D into vector (vectorize) DVEC
Rank distances in DVEC: RDVEC = grade(DVEC)
[grade is ranking by consecutive numbers ignoring ties, e.g. vector 1.2, 2.4, 2.4, 2.9 will turn into ranks 1, 2, 3, 4]

    DVEC        RDVEC
  .5385165    11.0000000
  .5099020     8.0000000
  .6480741    16.0000000
  .1414214     1.0000000
 4.0037482    71.0000000
 3.6166283    61.0000000
 4.1641326    74.0000000
 3.0935417    58.0000000
 3.7920970    65.0000000
 5.2848841    96.0000000
 4.2083251    78.0000000
 5.3018865    97.0000000
 4.6904158    86.0000000
 5.0566788    91.0000000
  .3000000     4.0000000
  .3316625     5.0000000
  .6082763    13.0000000
 4.0963398    73.0000000
 3.6864617    63.0000000
 4.2367440    80.0000000
 2.9698485    56.0000000
 3.8118237    66.0000000
 5.3385391   100.0000000
 4.1809090    77.0000000
 5.3572381   102.0000000
 4.7085029    87.0000000
 5.0911688    92.0000000
  .2449490     2.0000000
  .5099020     9.0000000
 4.2766810    82.0000000
 3.8496753    68.0000000
 4.4158804    85.0000000
 3.1543621    60.0000000
 3.9974992    70.0000000
 5.4726593   104.0000000
 4.3347434    84.0000000
 5.5290144   105.0000000
 4.8682646    90.0000000
 5.2469038    95.0000000
  .6480741    17.0000000
 4.1773197    76.0000000
 3.7336309    64.0000000
 4.3058100    83.0000000
 2.9849623    57.0000000
 3.8729833    69.0000000
 5.3357286    99.0000000
 4.1773197    75.0000000
 5.4064776   103.0000000
 4.7222876    88.0000000
 5.1097945    94.0000000
 4.0607881    72.0000000
 3.6633318    62.0000000
 4.2190046    79.0000000
 3.1480152    59.0000000
 3.8496753    67.0000000
 5.3131911    98.0000000
 4.2461747    81.0000000
 5.3507009   101.0000000
 4.7307505    89.0000000
 5.0960769    93.0000000
  .6403124    15.0000000
  .2645751     3.0000000
 1.8867962    51.0000000
  .6557439    19.0000000
 1.8439089    49.0000000
 1.4491377    40.0000000
 1.4071247    39.0000000
 1.2449900    32.0000000
 1.4628739    41.0000000
  .6480741    18.0000000
 1.3820275    37.0000000
  .4242641     6.0000000
 1.8083141    48.0000000
 1.0630146    26.0000000
 1.6881943    46.0000000
 1.1832160    29.0000000
 1.4933185    42.0000000
 1.8574176    50.0000000
  .5830952    12.0000000
 1.6155494    45.0000000
 1.2529964    33.0000000
 1.1874342    30.0000000
  .9899495    25.0000000
 1.2124356    31.0000000
 1.2845233    34.0000000
 2.6608269    54.0000000
 1.3490738    36.0000000
 2.7018512    55.0000000
 1.9519221    52.0000000
 2.3537205    53.0000000
 1.8027756    47.0000000
  .9539392    24.0000000
 1.5652476    43.0000000
 1.0677078    27.0000000
 1.4035669    38.0000000
 1.3341664    35.0000000
  .9486833    23.0000000
  .9000000    21.0000000
  .5099020    10.0000000
 1.5684387    44.0000000
  .7416198    20.0000000
 1.0770330    28.0000000
  .9110434    22.0000000
  .6164414    14.0000000
  .5000000     7.0000000

4.
Compute sum of within-cluster distances of "utmost best" solution with the given number of within- and between-cluster distances
MINSW = msum(DVEC &* (RDVEC<=NW))  =  21.55926197
[expression RDVEC<=NW is indicator column whether RDVEC value is <=NW or not]
Likewise compute sum of within-cluster distances of "utmost worst" solution with the given number of within- and between-cluster distances
MAXSW = msum(DVEC &* (RDVEC>NB))  =  149.7900756

Of course, you can alternatively compute MINSW and MAXSW after sorting of DVEC, not ranking it.

5.
Compute C-index
CINDEX = (SW-MINSW)/(MAXSW-MINSW)  =  .0389396