Solved – Confusion about scatter matrices

clusteringmatrix

I am learning about evaluating clustering outcome and am confused about the scatter matrices. Hoping to get some help here.

The within-cluster scatter matrix $S_W$is defined as:
$$
S_W=\sum _{ k=1 }^{ K }{ \sum _{ x\in { C }_{ k } }^{ }{ \left( x-{ \mu }_{ k } \right) { \left( x-{ \mu }_{ k } \right) }^{ T } } }
$$
The between-cluster matrix $S_B$ is defined as:
$$
S_B=\sum _{ k=1 }^{ K }{ N_k }{ \left( { \mu }_{ k }-{ \mu } \right) { \left( { \mu }_{ k }-{ \mu } \right) }^{ T } }
$$
where $K$ is number of clusters, $x$ is a member in cluster $C_k$, $\mu_k$ is centroids of cluster $C_k$, $N_k$ is number of members in cluster $C_k$, $\mu$ is the mean of the whole dataset.

My dataset has the form of $m$ by $d$, i.e. $m$ data points of $d$ dimensions (features). After clustering, each cluster $C_k$ has the form of $N_k$ by $d$. So, a point $x$ has the dimension of $1\times d$, likewise $\mu_k$ has the dimension of $1\times d$. And, $S_W$ and $S_B$ are scalar. Why are they not matrices? What exactly should I expect in the elements of the matrices? With that result, the use of $trace \left( S_W \right)$ and $trace \left( S_B \right)$ become irrelevant.

I sure has not understood the subject correctly. I will appreciate any help here.

Update 1:

The scatter matrix for each cluster is given as:
$$
S_k=\sum _{ x\in { C }_{ k } }^{ }{ \left( x-{ \mu }_{ k } \right) { \left( x-{ \mu }_{ k } \right) }^{ T } }
$$
which gives me a scalar quantity for $x$ and $\mu_k$ of size $1 \times d$. What should $S_{ k(i,j) }$ be? Given a dataset of $n$ observations with $d$ variables/features (i.e. $n \times d$), what should be the size of $S_k$?

To further clarify my problem, suppose one of the clusters has 2 members (rows) with 3 variables/features (columns):
$$
C_k = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}
$$
Then
$$
{ x }_{ 1 }=\left[ \begin{matrix} 1 & 2 & 3 \end{matrix} \right] \\ { x }_{ 2 }=\left[ \begin{matrix} 4 & 5 & 6 \end{matrix} \right] \\ { \mu }_{ k }=mean\left( { x }_{ 1 },{ x }_{ 2 } \right) =\left[ \begin{matrix} 2.5 & 3.5 & 4.5 \end{matrix} \right] \\ \left( { x }_{ 1 }-{ \mu }_{ k } \right) { \left( { x }_{ 1 }-{ \mu }_{ k } \right) }^{ T }=\left[ \begin{matrix} -1.5 & -1.5 & -1.5 \end{matrix} \right] \left[ \begin{matrix} -1.5 \\ -1.5 \\ -1.5 \end{matrix} \right] =6.75\\ \left( { x }_{ 2 }-{ \mu }_{ k } \right) { \left( { x }_{ 2 }-{ \mu }_{ k } \right) }^{ T }=\left[ \begin{matrix} 1.5 & 1.5 & 1.5 \end{matrix} \right] \left[ \begin{matrix} 1.5 \\ 1.5 \\ 1.5 \end{matrix} \right] =6.75\\ { S }_{ k }=\sum _{ x\in { C }_{ k } }^{ }{ \left( x-{ \mu }_{ k } \right) { \left( x-{ \mu }_{ k } \right) }^{ T } } =6.75+6.75=13.5
$$
So, $S_k$ is scalar and consequently $S_W$ will be scalar. Where did I go wrong in the above computation of scatter matrix for one cluster?

Update 2:

So, the data points should be column vectors.
$$
{ x }_{ 1 }=\left[ \begin{matrix} 1 \\ 2 \\ 3 \end{matrix} \right] \\ { x }_{ 2 }=\left[ \begin{matrix} 4 \\ 5 \\ 6 \end{matrix} \right] \\ { \mu }_{ k }=mean\left( { x }_{ 1 },{ x }_{ 2 } \right) =\left[ \begin{matrix} 2.5 \\ 3.5 \\ 4.5 \end{matrix} \right] \\ \left( { x }_{ 1 }-{ \mu }_{ k } \right) { \left( { x }_{ 1 }-{ \mu }_{ k } \right) }^{ T }=\left[ \begin{matrix} -1.5 \\ -1.5 \\ -1.5 \end{matrix} \right] \left[ \begin{matrix} -1.5 & -1.5 & -1.5 \end{matrix} \right] =\left[ \begin{matrix} 2.25 & 2.25 & 2.25 \\ 2.25 & 2.25 & 2.25 \\ 2.25 & 2.25 & 2.25 \end{matrix} \right]\\ \left( { x }_{ 2 }-{ \mu }_{ k } \right) { \left( { x }_{ 2 }-{ \mu }_{ k } \right) }^{ T }=\left[ \begin{matrix} 1.5 \\ 1.5 \\ 1.5 \end{matrix} \right] \left[ \begin{matrix} 1.5 & 1.5 & 1.5 \end{matrix} \right] =\left[ \begin{matrix} 2.25 & 2.25 & 2.25 \\ 2.25 & 2.25 & 2.25 \\ 2.25 & 2.25 & 2.25 \end{matrix} \right]\\ { S }_{ k }=\sum _{ x\in { C }_{ k } }^{ }{ \left( x-{ \mu }_{ k } \right) { \left( x-{ \mu }_{ k } \right) }^{ T } } =\left[ \begin{matrix} 2.25 & 2.25 & 2.25 \\ 2.25 & 2.25 & 2.25 \\ 2.25 & 2.25 & 2.25 \end{matrix} \right]+\left[ \begin{matrix} 2.25 & 2.25 & 2.25 \\ 2.25 & 2.25 & 2.25 \\ 2.25 & 2.25 & 2.25 \end{matrix} \right]=\left[ \begin{matrix} 4.5 & 4.5 & 4.5 \\ 4.5 & 4.5 & 4.5 \\ 4.5 & 4.5 & 4.5 \end{matrix} \right]
$$
And finally, $S_W=\sum _{ k=1 }^{ K } S_k$. Yes, this is matrix! Do I get it right this time?

The size of $S_W$ is $d \times d$ ($d$ being dimension/no. of features of the data points). $trace(S_W)$ is then the sum-of-squared-error.

Update 3:

Using the approach given by @ttnphns, the data matrix for cluster k can be arranged in rows (while the equations above have data in columns):
$$
X_k = \begin{bmatrix} x_1^T \\ x_2^T \end{bmatrix} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}
$$

Each columns has 2 elements, hence to center the matrix columns, the required center matrix is
$$
C_2 = I(2) – \frac{1}{n}O(2) = \begin{bmatrix} 0.5 & -0.5 \\ -0.5 & 0.5 \end{bmatrix}
$$
where $I(2)$ is identify matrix of size 2, $O(2)$ is 2-by-2 matrix of all 1's.

Center columns of matrix $X_k$,
$$
X_k^c = C_2X_k = \begin{bmatrix} 0.5 & -0.5 \\ -0.5 & 0.5 \end{bmatrix}\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}\\=\begin{bmatrix} -1.5 & -1.5 & -1.5 \\ 1.5 & 1.5 & 1.5 \end{bmatrix}
$$
Do $X_k^{c~T}X_k^c$ gives $S_k$
$$
S_k=X_k^{c~T}X_k^c = \begin{bmatrix} -1.5 & 1.5 \\ -1.5 & 1.5 \\ -1.5 & 1.5 \end{bmatrix}\begin{bmatrix} -1.5 & -1.5 & -1.5 \\ 1.5 & 1.5 & 1.5 \end{bmatrix}\\=\begin{bmatrix} 4.5 & 4.5 & 4.5 \\ 4.5 & 4.5 & 4.5 \\ 4.5 & 4.5 & 4.5 \end{bmatrix}
$$
This is the same as in Update 2.

Best Answer

What @whuber and I were saying to you in the comments are equivalent things. @whuber pointed out that the text you were reading makes points column vectors. I stuck to your own original notation where points are row vectors (this way of presenting is more common). When points are columns, thansposed ("T", or just ' in my notation) multiplier is the right one; when they are rows, it is the left one. Instead of multiplying separate vectors, it's more convenient to multiply whole matrices. See it with your data (matrix A = your "Ck"):

****** Points are rows, variables are columns [more common] ******
A
  1  2  3
  4  5  6

Column-centered A
  -1.500000000  -1.500000000  -1.500000000
   1.500000000   1.500000000   1.500000000

A'A, the scatter matrix
   4.500000000   4.500000000   4.500000000
   4.500000000   4.500000000   4.500000000
   4.500000000   4.500000000   4.500000000

****** Points are columns, variables are rows [that's how in your book] ******    
A
  1  4
  2  5
  3  6

Row-centered A
  -1.500000000   1.500000000
  -1.500000000   1.500000000
  -1.500000000   1.500000000

AA', the scatter matrix
   4.500000000   4.500000000   4.500000000
   4.500000000   4.500000000   4.500000000
   4.500000000   4.500000000   4.500000000