Solved – On cophenetic correlation for dendrogram clustering

classificationclustering

Consider the context of a dendrogram clustering. Let us call original dissimilarities the distances between the individuals. After constructing the dendrogram we define the cophenetic dissimilarity between two individuals as the distance between the clusters to which these individuals belong.

Some people consider that the correlation between the original dissimilarities and the cophenetic dissimilarities (called cophenetic correlation) is a "suitability index" of the classification. This sounds totally puzzling to me. My objection does not rely on the particular choice of the Pearson correlation, but on the general idea that any link between the original dissimilarities and the cophenetic dissimilarities could be related to the suitability of the classification.

Do you agree with me, or could you present some argument supporting the use of the cophenetic correlation as a suitability index for the dendrogram classification ?

Best Answer

... is a "suitability index" of the classification

To me it's not right clear what is meant by that. The way I got it, is that

the correlation between the original dissimilarities and the cophenetic dissimilarities (called cophenetic correlation)

is a measure of the hierarchical structure among the observations, i. e. their distances. That is to say the dissimilarities to observations in a different cluster are preferably similar. Considering to datasets A and B clustered using euclidean distance and complete linkage... enter image description here ...even without having a look at the cophenetic distance map or computing cophenetic correlation, one can see, that the cophenetic correlation of A is higher than that of B. In a hierarchy there are levels. So the CC tells about whether distances to observations on the same level (cluster) are similar.

For the sake of completeness: The cophenetic correlations are CC(A) = 0.936 and CC(B) = 0.691

Ad hoc approach

I'd assume that $\beta_i$ is reasonably reliable because it was estimated on many students, most of who did not cheat on question $i$. For each student $j$, sort the questions in order of increasing difficulty, compute $\beta_i + q_j$ (note that $q_j$ is just a constant offset) and threshold it at some reasonable place (e.g. p(correct) < 0.6). This gives a set of questions which the student is unlikely to answer correctly. You can now use hypothesis testing to see whether this is violated, in which case the student probably cheated (assuming of course your model is correct). One caveat is that if there are few such questions, you might not have enough data for the test to be reliable. Also, I don't think it's possible to determine which question he cheated on, because he always has a 50% chance of guessing. But if you assume in addition that many students got access to (and cheated on) the same set of questions, you can compare these across students and see which questions got answered more often than chance.

You can do a similar trick with questions. I.e. for each question, sort students by $q_j$, add $\beta_i$ (this is now a constant offset) and threshold at probability 0.6. This gives you a list of students who shouldn't be able to answer this question correctly. So they have a 60% chance to guess. Again, do hypothesis testing and see whether this is violated. This only works if most students cheated on the same set of questions (e.g. if a subset of questions 'leaked' before the exam).

Principled approach

For each student, there is a binary variable $c_j$ with a Bernoulli prior with some suitable probability, indicating whether the student is a cheater. For each question there is a binary variable $l_i$, again with some suitable Bernoulli prior, indicating whether the question was leaked. Then there is a set of binary variables $a_{ij}$, indicating whether student $j$ answered question $i$ correctly. If $c_j = 1$ and $l_i = 1$, then the distribution of $a_{ij}$ is Bernoulli with probability 0.99. Otherwise the distribution is $logit(\beta_i + q_j)$. These $a_{ij}$ are the observed variables. $c_j$ and $l_i$ are hidden and must be inferred. You probably can do it by Gibbs sampling. But other approaches might also be feasible, maybe something related to biclustering.

Solved – Validate dendrogram in cluster analysis: What is the meaning of cophenetic correlation coefficient

The cophenetic correlation coefficient is defined as the linear correlation between the dissimilarities $d_{ij}$ between each pair of observations $(i,j)$ and their corresponding cophenetic distances $d_{ij}^{coph}$, which is the intergroup dissimilarity at which the observations $i, j$ first merged together in the same cluster.

So you get the cophenetic correlation coefficient $CCC$ by calculating the correlation between those values. Let $D$ be the distance matrix according to $d$ and $Z$ be the distance matrix according to $d^{coph}$, $\bar{D}, \bar{Z}$ denotes the means of $d_{ij}$ and $d_{ij}^{coph}$ respectively, then

$CCC(D,Z) = Cor(D,Z) = \frac{\sum\limits_{i<j} (D_{ij} - \bar{D})(Z_{ij} - \bar{Z}) }{\sqrt{\sum\limits_{i<j} (D_{ij} - \bar{D})^2 \sum\limits_{i<j} (Z_{ij} - \bar{Z})^2 }}$

(see: Mathworks Documentation: cophenetic correlation coefficient)

This should be equal to what you have done by calculating

cor(euclidian_dist, coph)

So, I think your assumption is correct.

Best Answer

Related Solutions

Solved – Detecting patterns of cheating on a multi-question exam

Ad hoc approach

Principled approach

Solved – Validate dendrogram in cluster analysis: What is the meaning of cophenetic correlation coefficient

Related Question