Solved – Clustering: Variance or Squared deviation from the mean

clusteringhierarchical clusteringpcavariance

I'm studying performance analysis and in the summarizing data methods there's the clustering. I understood what clustering is and the various algorithms to clusterize data, however in my book (The art of computer systems performance analysis, Raj Jain) there's written (according to clustering techniques) the following:
clust
What I really don't understand is that statistic formula about variance underlined in red. How can we talk about intragroup and intergroup variance in clustering? My professor said that this formula is correct just for SDM(squared deviation from the mean) but not variances since you can't divide by n-1 each cluster as they are not populated of n elements. So we should talk about intragroup SDM and intergroup SDM. Furthermore let's suppose I computed a PCA before clustering and that I selected just 2 principal components that explains the 93% of variance (here it's clear we can talk about variances). Then I compute clustering (Ward method), and dendogram shows me the intragroup distance according to the number of the cluster I choose. How do I extimate the total variance lost after doing PCA+ clustering? How could it be done if I have to extimate SDM and not variance after clustering? Why on my book that formula refers to variances? Thanks
P.S. I used jmp tool

Best Answer

Forget of PCA for now. You don't need PCA here.

k-means clustering (and Ward) are based on the idea of minimizing the sum-of-squares; and that value is related to variance.

There are different kinds of variance. Here you need the simple one, although it supposedly is biased. For large data, the difference doesn't matter. To understand that statement, assume variance is the simple $$1/N \sum_{x_i\in S} ||x_i - \mu_S||^2$$ And not N-1.

But now when looking at clusters, you want a cluster with many objects to have more influence. So for N objects, you take N times the variance. I.e., you take simply the sum of squares.

The proper version of the equation is

total-SSQ = within-cluster-SSQ + inbetween-clusters-SSQ = constant

It says that minimizing the SSQ (and equivalent, minimizing variance of a cluster) increases the separation of clusters; and conversely.