R Clustering – Updates on Cubic Clustering Criterion Using R

clusteringr

I have seen other users ask about recreating SAS's CCC output in other programs. This question, Cubic clustering criterion in R, has an answer that says to use NbClust to calculate, but that function does not handle large datasets well. It makes a call to dist that must allocate a 50 gig object. I have tried replacing the function with cluster::daisy, and proxy::dist from this SO question with the same memory problems.

Avoiding the dist call altogether may be the best option. I am looking to other options to recreate it. In this question How to define number of clusters in K-means clustering?, a user goes through the math provided by SAS. But I do not have the stats chops to translate that into R code.

Keeping it simple, I have kmeans output that provides total sum of squares (tot.ss), within.ss, between.ss, and I also calculated the $R^2$.

kmeans(x = mydata, centers = 23, iter.max = ITER)
Within cluster sum of squares by cluster:
 [1]  91248.77  72122.06  78680.32  90402.25  86341.35 153533.51  73988.63  64903.32
 [9]  38334.98  84125.14  92366.93  74721.24 110313.76  96859.55  84516.37  56068.08
[17]  76201.69  86194.35  59526.00  53709.75  72503.21  50767.36  80531.94
 (between_SS / total_SS =  36.5 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"   
[7] "size"         "iter"         "ifault"

Can I calculate the CCC using these measures?


The second question has a long description from the SAS pdf. But I saw a simplified equation here.

enter image description here

where $E(R^2)$ is the expected $R^2$, and $R^2$ is the observed $R^2$, and $K$ is the variance-stabilizing transformation.

*Can this equation be completed by R's kmeans output and a calculated $R^2$

Edit

One reason why I am focusing on kmeans is that SAS users utilize PROC FASTCLUS when running large datasets. It is equivalent to R's kmeans function. The package NbClust calculates the CCC that I'm looking for, but it does it on the full data with euclidean distance, which is impossible for most computers. That is equivalent to SAS's PROC CLUSTER.

Best Answer

I figured it out. The solution I chose is to rewrite and extract what NbClust was doing but to exclude the dist matrix call and everything else that I did not need. I check my custom CCC function against the actual output to be sure that the output is the same:

> NbClust(iris[-5], min.nc=3, max.nc=3, index="ccc", method="kmeans")
$All.index
[1] 37.6701

$Best.nc
Number_clusters     Value_Index 
         3.0000         37.6701 


> CCC(iris[-5], 3)
[1] 37.67012

The measures are the same.

Here's the full function for anyone else interested in CCC in R.

As some may notice, this function also calculates other disgnostics like c("scott", "rubin", "marriot", "friedman"). I only needed the CCC for my purposes but the others can also be extracted:

CCC <- function(data, nc) {
  Indices.WBT <- function(x,cl,P,s,vv) 
{
  n <- dim(x)[1]
  pp <- dim(x)[2]
  qq <- max(cl)
  z <- matrix(0,ncol=qq,nrow=n)
  clX <- as.matrix(cl)

  for (i in 1:n)
    for (j in 1:qq)
    {
      z[i,j]==0
      if (clX[i,1]==j) 
      {z[i,j]=1}
    }

  xbar <- solve(t(z)%*%z)%*%t(z)%*%x
  B <- t(xbar)%*%t(z)%*%z%*%xbar
  W <- P-B
  marriot <- (qq^2)*det(W)
  trcovw <- sum(diag(cov(W)))
  tracew <- sum(diag(W))
  if(det(W)!=0)
    scott <- n*log(det(P)/det(W))
  else {cat("Error: division by zero!")}
  friedman <- sum(diag(solve(W)*B))
  rubin <- sum(diag(P))/sum(diag(W))


  R2 <- 1-sum(diag(W))/sum(diag(P))
  v1 <- 1
  u <- rep(0,pp)
  c <- (vv/(qq))^(1/pp)
  u <- s/c
  k1 <- sum((u>=1)==TRUE)
  p1 <- min(k1,qq-1)
  if (all(p1>0,p1<pp))
  {
    for (i in 1:p1)
      v1 <- v1*s[i]
    c <- (v1/(qq))^(1/p1)
    u <- s/c
    b1 <- sum(1/(n+u[1:p1]))
    b2 <- sum(u[p1+1:pp]^2/(n+u[p1+1:pp]),na.rm=TRUE)
    E_R2 <- 1-((b1+b2)/sum(u^2))*((n-qq)^2/n)*(1+4/n)
    ccc <- log((1-E_R2)/(1-R2))*(sqrt(n*p1/2)/((0.001+E_R2)^1.2))
  }else 
  {
    b1 <- sum(1/(n+u))
    E_R2 <- 1-(b1/sum(u^2))*((n-qq)^2/n)*(1+4/n)
    ccc <- log((1-E_R2)/(1-R2))*(sqrt(n*pp/2)/((0.001+E_R2)^1.2))
  }
  results <- list(ccc=ccc,scott=scott,marriot=marriot,trcovw=trcovw,tracew=tracew,friedman=friedman,rubin=rubin)
  return(results)
}


nc <- nc
jeu1 <- as.matrix(data)
numberObsBefore <- dim(jeu1)[1]
jeu <- na.omit(jeu1) # returns the object with incomplete cases removed 
nn <- numberObsAfter <- dim(jeu)[1]
pp <- dim(jeu)[2]    
TT <- t(jeu)%*%jeu   
sizeEigenTT <- length(eigen(TT)$value)
eigenValues <- eigen(TT/(nn-1))$value 

for (i in 1:sizeEigenTT) 
{
  if (eigenValues[i] < 0) {
    #cat(paste("There are only", numberObsAfter,"nonmissing observations out of a possible", numberObsBefore ,"observations."))
    stop("The TSS matrix is indefinite. There must be too many missing values. The index cannot be calculated.")
  } 
}
s1 <- sqrt(eigenValues)
ss <- rep(1,sizeEigenTT)
for (i in 1:sizeEigenTT) 
{
  if (s1[i]!=0) 
    ss[i]=s1[i]
}
vv <- prod(ss)  

set.seed(1)  
cl1 <- kmeans(jeu,nc)$cluster
TT <- t(jeu)%*%jeu 
Indices.WBT(x=jeu, cl=cl1, P=TT,s=ss,vv=vv)$ccc
}
Related Question