I have generated clusters for two different datasets (d1 and d2) with Hierarchical Clustering algorithm and I would like to calculate the similarity between the clusters generated for d1 and d2.
Example: Compare d1_1 to d2_1, where "_x" is the cluster number
Is important to note that each cluster can have different number of objects, but all clusters have the same attributes types:
**Atribute name** **data type**
Attr1 - nominal
Attr2 - numerical
Attr3 - nominal
Attr4 - nominal
Attr5 - numerical
Example:
Cluster 1 from d1:
- Car, 12, Diesel, Black, 24
- Car, 13, Diesel, Blue, 2200
- Truck, 12, Diesel, Black, 24
- Car, 12, Diesel, Black, 82
Cluster 1 from d2:
- Car, 12, Gasoline, Black, 24
- Truck, 12, Diesel, Black, 24
- Car, 12, Diesel, Black, 82
If possible, I would like to have a value of similarity (between 2 clusters) between 0 and 1 or a percentage of similarity.
I assume that two clusters are similar if they have close numbers (if numeric type) and equal values (in nominal type)
Best Answer
Define a distance function between data points and this becomes easier. Let $F_x(i)$ be the $i$th numerical feature and $D_x(i)$ be the $i$th nominal feature (as a one-hot vector) of data point $x$. Then the distance between data points $x$ and $y$ can be, for instance, $$ \delta(x,y) = \sum_i \gamma_i | F_x(i) - F_y(i) |^p + \eta\sum_j || D_x(j) - D_y(j) ||_2^2 $$ where we can choose $p,\gamma_i,\eta$ based on the data itself. For instance, we can choose $p=1$, $\eta=1/|D|$ as one over the number of nominal features, and $$ \gamma_i = \frac{1}{|F|\,|X|}\sum_{x\in X} F_x(i) $$ as the weight for numerical feature $i$, for the dataset $X$, so that the relative contribution of each term is similar in magnitude.
Then, given two clusters $C_1$ and $C_2$, there are many ways to compute normalized similarity. One is just $$ S(C_1,C_2) = \frac{1}{1+\Delta(C_1,C_2)},\;\;\text{where}\;\; \Delta(C_1,C_2) = \frac{1}{|C_1|\,|C_2|} \sum_{x\in C_1} \sum_{y\in C_2} \delta(x,y) $$ so that we get a similarity of $1$ when the clusters are identical and something close to $0$ when they are very different. Another, for instance, is $S_e(C_1,C_2)=\exp(-\Delta(C_1,C_2))$.
Alternatively, we could replace each $D_x(\ell)$ with a one-hot vector, and "unfold" each data point into a vector of numbers $\vec{x}$. Then we could compute a similarity via $$ \tau_c(\vec{x},\vec{y}) = \frac{\vec{x}\cdot\vec{y}}{||\vec{x}||_2\,||\vec{y}||_2} $$ which measures the angle between the unitized vectors in the data space. This is the cosine similarity, so $\tau_c\in[-1,1]$. (Note that no attempt is made to account for the magnitude similarities across dimensions.) Then we can measure overall similarity via $$ S_c(C_1,C_2) = \frac{1}{|C_1|\,|C_2|} \sum_{x\in C_1} \sum_{y\in C_2} \frac{\tau_c(\vec{x},\vec{y}) + 1}{2} $$ which is $0$ for very different clusters and $1$ for very close ones.