Solved – Validate dendrogram in cluster analysis: What is the meaning of cophenetic correlation coefficient

clusteringcorrelationhierarchical clusteringrstatistical significance

I want to calculate the cophenetic correlation coefficient.
Reading previous posts

Comparison of cophenetic correlation coefficients on different data sets

On cophenetic correlation for dendrogram clustering

https://stackoverflow.com/questions/5639794/in-r-how-can-i-plot-a-similarity-matrix-like-a-block-graph-after-clustering-d

I used the cophenetic function in the package stats.
As far as I understand the results are cophenetic distances for the hierarchical clustering, in a new object of class dis.

coph<-cophenetic(hclsut_result)

To have an overview I clustered the cophenetic matrix, and I obtained the same clustering as the one performed on my dataset.

However, I wanted to have a unique value that indicates the fidelity with which my clustering represents my distance matrix. Therefore, I correlated the dis_matrix_for_my_dataset with the coph.

cor(euclidian_dist, coph)

Am I understanding right that the value I obtain indicates the cophenetic correlation coefficient?

Best Answer

The cophenetic correlation coefficient is defined as the linear correlation between the dissimilarities $d_{ij}$ between each pair of observations $(i,j)$ and their corresponding cophenetic distances $d_{ij}^{coph}$, which is the intergroup dissimilarity at which the observations $i, j$ first merged together in the same cluster.

So you get the cophenetic correlation coefficient $CCC$ by calculating the correlation between those values. Let $D$ be the distance matrix according to $d$ and $Z$ be the distance matrix according to $d^{coph}$, $\bar{D}, \bar{Z}$ denotes the means of $d_{ij}$ and $d_{ij}^{coph}$ respectively, then

$CCC(D,Z) = Cor(D,Z) = \frac{\sum\limits_{i<j} (D_{ij} - \bar{D})(Z_{ij} - \bar{Z}) }{\sqrt{\sum\limits_{i<j} (D_{ij} - \bar{D})^2 \sum\limits_{i<j} (Z_{ij} - \bar{Z})^2 }}$

(see: Mathworks Documentation: cophenetic correlation coefficient)

This should be equal to what you have done by calculating

cor(euclidian_dist, coph)

So, I think your assumption is correct.

Best Answer

Related Solutions

Solved – Choosing the number of clusters in hierarchical agglomerative clustering

Solved – Dendrogram in Hybrid Hierarchical Clustering and Cut-off criterion (Calinski-Harabasz presently)

Related Question