Solved – distance between 2 clusters

distance

I have the formula and an example for calculating the distance between 2 clusters below,

I couldn't figure out why and how {2,4,8} can become 8 and {-5,0,5} can become -5. Can someone please enlighten me please?

Best Answer

$D$ is a linkage criteria, it is known as complete linkage. Goto Wikipedia:

https://en.wikipedia.org/wiki/Hierarchical_clustering#Metric

and you should see this:

Does the formula look similar to yours?

The distance between the two clusters is the maximum between the two clusters. Obviously, $8$ and $-5$ are the furthest in your scenario. For instance, $D(8,0)$ = |8-0| = only 8.

Related Solutions

Solved – How is the distance formula related to the formula for standard deviation

Any set in which you can define a 'distance' function which satisfies a few properties (distances are positive, symmetric, and additive). Is called a Metric space. $\mathbb{R}^k$ is a metric space with the distance function typically defined to be $d(\mathbf{x},\mathbf{y}) = |\mathbf{x}-\mathbf{y}|$, the norm of the difference (although we can use whatever distance function we want as long as it satisfies the 3 properties, more on that later).

The norm is defined to be $|\mathbf{x}| = \sqrt{\sum_{i=1}^n x_i^2}$. That right there looks strangely familiar you might think. So if you have some observed values $\mathbf{x}=x_1,\ldots,x_n$ and if we find the distance between your observed values and their mean, $\mu$ we have $d(\mathbf{x},\mu) = |\mathbf{x}-\mu| = \sqrt{\sum_{i=1}^n (x_i-\mu)^2}$ which is almost like the standard deviation (missing a $1/n$ or $1/(n-1)$. However, we can easily redefine our distance function to be something like $d(\mathbf{x},\mathbf{y}) = +\sqrt{1/n}|\mathbf{x}-\mathbf{y}|$ and it will still have the three properties required to make $\mathbb{R}^k$ a metric space.

You might be more familiar with distances in a 2-dimensional space like $\mathbb{R}^2$. In this space we can use the same distance function as above, but since instead of $k$ components we have only 2 the formula simplifies to $d((x_1,y_1), (x_2,y_2)) = \sqrt{(x_1-x_2)^2+(y_1-y_2)^2}$

Clustering – Why Minimizing WCSS Maximizes Distance Between Clusters in K-means

K-means is all about the analysis-of-variance paradigm. ANOVA - both uni- and multivariate - is based on the fact that the sum of squared deviations about the grand centroid is comprised of such scatter about the group centroids and the scatter of those centroids about the grand one: SStotal=SSwithin+SSbetween. So, if SSwithin is minimized then SSbetween is maximized.

SS of deviations of some points about their centroid (arithmetic mean) is known to be directly related to the overall squared euclidean distance between the points: the sum of squared deviations from centroid is equal to the sum of pairwise squared Euclidean distances divided by the number of points. (This is the direct extension of the trigonometric property of centroid. And this relation is exploited also in the double centering of distance matrix.)

Thus, saying "SSbetween for centroids (as points) is maximized" is alias to say "the (weighted) set of squared distances between the centroids is maximized".

Note: in SSbetween each centroid is weighted by the number of points Ni in that cluster i. That is, each centroid is counted Ni times. For example, with two centroids in the data, 1 and 2, SSbetween = N1*D1^2+N2*D2^2 where D1 and D2 are the deviations of the centroids from the grand mean. That where word "weighted" in the former paragraph stems from.

Example

Data (N=6: N1=3, N2=2, N3=1)
        V1            V2       Group
       2.06          7.73          1
        .67          5.27          1
       6.62          9.36          1
       3.16          5.23          2
       7.66          1.27          2
       5.59          9.83          3

SSdeviations
        V1            V2            Overall
SSt   37.82993333 + 51.24408333 = 89.07401666
SSw   29.50106667 + 16.31966667 = 45.82073333
SSb   8.328866667 + 34.92441667 = 43.25328333

SSt is directly related to the squared Euclidean distances between the data points:

Matrix of squared Euclidean distances
     .00000000    7.98370000   23.45050000    7.46000000   73.09160000   16.87090000 
    7.98370000     .00000000   52.13060000    6.20170000   64.86010000   45.00000000 
   23.45050000   52.13060000     .00000000   29.02850000   66.52970000    1.28180000 
    7.46000000    6.20170000   29.02850000     .00000000   35.93160000   27.06490000 
   73.09160000   64.86010000   66.52970000   35.93160000     .00000000   77.55850000 
   16.87090000   45.00000000    1.28180000   27.06490000   77.55850000     .00000000 

Its sum/2, the sum of the distances = 534.4441000
534.4441000 / N = 89.07401666 = SSt

The same reasoning holds for SSb.

Matrix of squared Euclidean distances between the 3 group centroids (see https://stats.stackexchange.com/q/148847/3277)
     .00000000   22.92738889   11.76592222 
   22.92738889     .00000000   43.32880000 
   11.76592222   43.32880000     .00000000 

3 centroids are 3 points, but SSb is based on N points (propagated centroids):
N1 points representing centroid1, N2 points representing centroid2 and N3 representing centroid3.
Therefore the sum of the distances must be weighted accordingly:
N1*N2*22.92738889 + N1*N3*11.76592222 + N2*N3*43.32880000 = 259.51969998
259.51969998 / N = 43.25328333 = SSb

Moral in words: maximizing SSb is equivalent to maximizing
the weighted sum of pairwise squared distances between the centroids.
(And maximizing SSb corresponds to minimizing SSw, since SSt is constant.)

Best Answer

Related Solutions

Solved – How is the distance formula related to the formula for standard deviation

Clustering – Why Minimizing WCSS Maximizes Distance Between Clusters in K-means

Related Question