I have seen other users ask about recreating SAS's CCC output in other programs. This question, Cubic clustering criterion in R, has an answer that says to use NbClust
to calculate, but that function does not handle large datasets well. It makes a call to dist
that must allocate a 50 gig object. I have tried replacing the function with cluster::daisy
, and proxy::dist
from this SO question with the same memory problems.
Avoiding the dist
call altogether may be the best option. I am looking to other options to recreate it. In this question How to define number of clusters in K-means clustering?, a user goes through the math provided by SAS. But I do not have the stats chops to translate that into R code.
Keeping it simple, I have kmeans
output that provides total sum of squares (tot.ss), within.ss, between.ss, and I also calculated the $R^2$.
kmeans(x = mydata, centers = 23, iter.max = ITER)
Within cluster sum of squares by cluster:
[1] 91248.77 72122.06 78680.32 90402.25 86341.35 153533.51 73988.63 64903.32
[9] 38334.98 84125.14 92366.93 74721.24 110313.76 96859.55 84516.37 56068.08
[17] 76201.69 86194.35 59526.00 53709.75 72503.21 50767.36 80531.94
(between_SS / total_SS = 36.5 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss"
[7] "size" "iter" "ifault"
Can I calculate the CCC using these measures?
The second question has a long description from the SAS pdf. But I saw a simplified equation here.
where $E(R^2)$ is the expected $R^2$, and $R^2$ is the observed $R^2$, and $K$ is the variance-stabilizing transformation.
*Can this equation be completed by R's kmeans
output and a calculated $R^2$
Edit
One reason why I am focusing on kmeans
is that SAS users utilize PROC FASTCLUS
when running large datasets. It is equivalent to R's kmeans function. The package NbClust
calculates the CCC that I'm looking for, but it does it on the full data with euclidean distance, which is impossible for most computers. That is equivalent to SAS's PROC CLUSTER
.
Best Answer
I figured it out. The solution I chose is to rewrite and extract what
NbClust
was doing but to exclude thedist
matrix call and everything else that I did not need. I check my customCCC
function against the actual output to be sure that the output is the same:The measures are the same.
Here's the full function for anyone else interested in
CCC
in R.As some may notice, this function also calculates other disgnostics like
c("scott", "rubin", "marriot", "friedman")
. I only needed the CCC for my purposes but the others can also be extracted: