[Math] How to calculate similarity between two clusters

clusteringmachine learningstatistics

I have generated clusters for two different datasets (d1 and d2) with Hierarchical Clustering algorithm and I would like to calculate the similarity between the clusters generated for d1 and d2.
Example: Compare d1_1 to d2_1, where "_x" is the cluster number

Is important to note that each cluster can have different number of objects, but all clusters have the same attributes types:

**Atribute name**    **data type** 
Attr1         -         nominal 
Attr2         -         numerical
Attr3         -         nominal
Attr4         -         nominal 
Attr5         -         numerical

Example:

Cluster 1 from d1:

  • Car, 12, Diesel, Black, 24
  • Car, 13, Diesel, Blue, 2200
  • Truck, 12, Diesel, Black, 24
  • Car, 12, Diesel, Black, 82

Cluster 1 from d2:

  • Car, 12, Gasoline, Black, 24
  • Truck, 12, Diesel, Black, 24
  • Car, 12, Diesel, Black, 82

If possible, I would like to have a value of similarity (between 2 clusters) between 0 and 1 or a percentage of similarity.

I assume that two clusters are similar if they have close numbers (if numeric type) and equal values (in nominal type)

Best Answer

Define a distance function between data points and this becomes easier. Let $F_x(i)$ be the $i$th numerical feature and $D_x(i)$ be the $i$th nominal feature (as a one-hot vector) of data point $x$. Then the distance between data points $x$ and $y$ can be, for instance, $$ \delta(x,y) = \sum_i \gamma_i | F_x(i) - F_y(i) |^p + \eta\sum_j || D_x(j) - D_y(j) ||_2^2 $$ where we can choose $p,\gamma_i,\eta$ based on the data itself. For instance, we can choose $p=1$, $\eta=1/|D|$ as one over the number of nominal features, and $$ \gamma_i = \frac{1}{|F|\,|X|}\sum_{x\in X} F_x(i) $$ as the weight for numerical feature $i$, for the dataset $X$, so that the relative contribution of each term is similar in magnitude.

Then, given two clusters $C_1$ and $C_2$, there are many ways to compute normalized similarity. One is just $$ S(C_1,C_2) = \frac{1}{1+\Delta(C_1,C_2)},\;\;\text{where}\;\; \Delta(C_1,C_2) = \frac{1}{|C_1|\,|C_2|} \sum_{x\in C_1} \sum_{y\in C_2} \delta(x,y) $$ so that we get a similarity of $1$ when the clusters are identical and something close to $0$ when they are very different. Another, for instance, is $S_e(C_1,C_2)=\exp(-\Delta(C_1,C_2))$.


Alternatively, we could replace each $D_x(\ell)$ with a one-hot vector, and "unfold" each data point into a vector of numbers $\vec{x}$. Then we could compute a similarity via $$ \tau_c(\vec{x},\vec{y}) = \frac{\vec{x}\cdot\vec{y}}{||\vec{x}||_2\,||\vec{y}||_2} $$ which measures the angle between the unitized vectors in the data space. This is the cosine similarity, so $\tau_c\in[-1,1]$. (Note that no attempt is made to account for the magnitude similarities across dimensions.) Then we can measure overall similarity via $$ S_c(C_1,C_2) = \frac{1}{|C_1|\,|C_2|} \sum_{x\in C_1} \sum_{y\in C_2} \frac{\tau_c(\vec{x},\vec{y}) + 1}{2} $$ which is $0$ for very different clusters and $1$ for very close ones.

Related Question