ANOVA – How to Compute Sums-of-Squares (Total, Between, Within) from a Distance Matrix

anovadistancesequence analysissums-of-squares

I am having trouble understanding the concept of Sum of Squares in the context of distance matrices (Studer et al. 2010).

The Sum of Squares I am familiar with is the classical $SS$ from ANOVA, performed on contingency table, such as

 sex    FE employment joblessness school
  1    16          4           0      0
  2     8          3           1      8

From which I can quickly compute the ANOVA Sum of Squares with

dtm = melt(dt, id.vars = c('sex'))
dtmcount = count(dtm, sex, value)

dtmcount %>% group_by() %>% 
  mutate(grandmean = mean(n)) %>%
  group_by(sex) %>% 
  mutate(SumSqtTotal = (n - grandmean)^2) %>% 
  group_by(sex) %>% mutate(groupmean = mean(n)) %>% 
  mutate(SSW = (n - groupmean)^2) %>%
  group_by(sex) %>% mutate(SSB = ( grandmean - groupmean)^2) %>%
  group_by() %>%
  summarise(SST = sum(SumSqtTotal), SSW = sum(SSW), SSB = sum(SSB))

  # results # 

  SST =  SSW    SSB
  53     20     33

The Sum of Squares makes sense to me because I understand that we are comparing means and decomposing means.

However, when it comes to distance matrices, I don't understand what the "mean" becomes.

Consider the same data, but this time we are comparing sequences.

   sex     Sep.93     Oct.93     Nov.93     Dec.93
1    1     school     school     school     school
2    1   training   training   training   training
3    1     school     school     school     school
4    1   training   training   training   training
5    1     school     school     school     school
6    2         FE         FE         FE         FE
7    2         FE         FE         FE         FE
8    2   training   training   training   training
9    2     school     school     school     school
10   2 employment employment employment employment

In a classical sequence analysis, we would use a distance algorithm to compute pairwise dissimilarities and end up with a distance matrix, like this one (from Hamming distance)

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    0    4    0    4    0    4    4    4    0     4
 [2,]    4    0    4    0    4    4    4    0    4     4
 [3,]    0    4    0    4    0    4    4    4    0     4
 [4,]    4    0    4    0    4    4    4    0    4     4
 [5,]    0    4    0    4    0    4    4    4    0     4
 [6,]    4    4    4    4    4    0    0    4    4     4
 [7,]    4    4    4    4    4    0    0    4    4     4
 [8,]    4    0    4    0    4    4    4    0    4     4
 [9,]    0    4    0    4    0    4    4    4    0     4
 [10,]   4    4    4    4    4    4    4    4    4     0

The analysis of variance for sequence analysis was developed by Studer et al. (2010).

From the paper (2010), I quote the following :

According to Batagelj (1988), the notion of a gravity center holds for any kind of distances and objects, even though it is not clearly defined for complex nonnumeric objects such as sequences. It is likely that the gravity center does not itself belong to the object space, exactly as the mean of integer values may be a real noninteger value. […] Even though the gravity center may not be observable, equation (4) provides a comprehensive way to compute the most central sequence, the medoid, of a set using weights. Searching the x that minimizes equation (4) is equivalent to minimizing the sum of the weighted distances from x to all other sequences. (p.8).

They developed a R function in the library TraMineR. It actually enables to get p.values from co-variates using permutation tests.

The output looks like this :

Pseudo ANOVA table:
        SS df      MSE
Exp    1.7  1 1.700000
Res    9.6  8 1.200000
Total 11.3  9 1.255556

However, I fail to completely understand how to compute the Sum of Squares for such matrix (manually if possible), for both the Total and the Explanatory ? What is the "mean" or "center" in this context ?

Thank you.

Data and codes

library(dplyr) 
library(reshape2) 
library(TraMineR) 

# data from TraMineR # 

dt = structure(list(sex = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L), .Label = c("1", "2"), class = "factor"), Sep.93 = structure(c(3L, 
4L, 3L, 4L, 3L, 1L, 1L, 4L, 3L, 2L), .Label = c("FE", "employment", 
"school", "training"), class = "factor"), Oct.93 = structure(c(3L, 
4L, 3L, 4L, 3L, 1L, 1L, 4L, 3L, 2L), .Label = c("FE", "employment", 
"school", "training"), class = "factor"), Nov.93 = structure(c(3L, 
4L, 3L, 4L, 3L, 1L, 1L, 4L, 3L, 2L), .Label = c("FE", "employment", 
"school", "training"), class = "factor"), Dec.93 = structure(c(3L, 
4L, 3L, 4L, 3L, 1L, 1L, 4L, 3L, 2L), .Label = c("FE", "employment", 
"school", "training"), class = "factor")), .Names = c("sex", 
"Sep.93", "Oct.93", "Nov.93", "Dec.93"), row.names = c(NA, -10L
), class = "data.frame")

# To transform the data into COUNT data # 

dtm = melt(dt, id.vars = c('sex'))
dtmcount = count(dtm, sex, value)

# The `SS` is easily compute with # 

dtmcount %>% group_by() %>% 
  mutate(grandmean = mean(n)) %>% group_by(sex) %>% 
  mutate(SumSqtTotal = (n - grandmean)^2) %>% 
  group_by(sex) %>% mutate(groupmean = mean(n)) %>% 
  mutate(SSW = (n - groupmean)^2) %>%
  group_by(sex) %>% mutate(SSB = ( grandmean - groupmean)^2) %>% group_by() %>%
  summarise(SST = sum(SumSqtTotal), SSW = sum(SSW), SSB = sum(SSB))

# Hamming distance function # 

 Ham = function(d){
  mat = matrix(0, nrow(d), nrow(d))

  len = nrow(d)
  mat = matrix(0, len, len)

  for(k in 1:len){
    for(i in 1:len){
      mat[k,i] = sum( ifelse( as.numeric( d[k, ] == d[i, ] ) == 1, 0 , 1) ) 
    }
  }
  return(mat)
}

used in the example, like this 
Ham(dt[,-1])

# The TraMineR Example # 

data(mvad)
library(dplyr) 

set.seed(10)
mv = mvad %>% group_by(male) %>% sample_n(5)

# compute the dissimilarity matrix 
mvad.ham <- seqdist(mvad.seq, method="HAM") 

# compute the discrepancy analysis 
d = dissassoc(mvad.ham, group = mv$male, R=10)

Ref :

Studer, Matthias, et al. "Discrepancy analysis of state sequences." Sociological Methods & Research 40.3 (2011): 471-510.

Gabadinho, Alexis, et al. "Analyzing and visualizing state sequences in R with TraMineR." Journal of Statistical Software 40.4 (2011): 1-37.

Best Answer

Instruction how you can compute sums of squares SSt, SSb, SSw out of matrix of distances (euclidean) between cases (data points) without having at hand the cases x variables dataset. You don't need to know the centroids' coordinates (the group means) - they pass invisibly "on the background": euclidean geometry laws allow so.

Let $\bf D$ be the N x N matrix of squared euclidean distances between the cases, and $G$ is the N x 1 column of group labels (k groups). Create binary dummy variables, aka N x k design matrix: $\mathbf G=design(G)$.

[I'll accompany the formulas with example data shared with this answer. The code is SPSS Matrix session syntax, almost a pseudocode-easy to understand.]

This is the raw data X (p=2 variables, columns),
with N=6 cases: n(1)=3, n(2)=2, n(3)=1
        V1            V2       Group
       2.06          7.73          1
        .67          5.27          1
       6.62          9.36          1
       3.16          5.23          2
       7.66          1.27          2
       5.59          9.83          3
------------------------------------

comp X= {2.06, 7.73;
          .67, 5.27;
         6.62, 9.36;
         3.16, 5.23;
         7.66, 1.27;
         5.59, 9.83}.
comp g= {1;1;1;2;2;3}.

!seuclid(X%D). /*This function to compute squared euclidean distances
                is taken from my web-page, it is techically more convenient 
                here than to call regular SPSS command to do it
print D.
    D
    .0000   7.9837  23.4505   7.4600  73.0916  16.8709
   7.9837    .0000  52.1306   6.2017  64.8601  45.0000
  23.4505  52.1306    .0000  29.0285  66.5297   1.2818
   7.4600   6.2017  29.0285    .0000  35.9316  27.0649
  73.0916  64.8601  66.5297  35.9316    .0000  77.5585
  16.8709  45.0000   1.2818  27.0649  77.5585    .0000

comp G= design(g).
print G.
    G 
  1  0  0 
  1  0  0 
  1  0  0 
  0  1  0 
  0  1  0 
  0  0  1 

comp Nt= nrow(G).
comp n= csum(G).
print Nt. /*This is total N
print n.  /*Group frequencies
  Nt 
  6 
  n 
  3  2  1

Quick method. Use if you want just the above three scalars. As mentioned in here or here, the sum of squared deviations from centroid is equal to the sum of pairwise squared Euclidean distances divided by the number of points. Then follows:

Total sum-of-squares (of deviations from grand centroid): $SS_t= \frac{\sum \bf D}{2N}$, where $\sum$ is the sum in the entire matrix.

Pooled within-group sum-of-squares (of deviations from group centroids): $SS_w= \sum \frac{diag(\bf G'DG)}{2\bf n'}$, where $\bf n$ is the k-length row vector of within-group frequencies, i.e. column sums in $\bf G$. Without the summation $\sum$ you have the k-length column vector: $SS_w$ in each group.

Between-group sum-of-squares is, of course, $SS_b=SS_t-SS_w$.

comp SSt= msum(D)/(2*Nt).
print SSt.
   SSt 
   89.07401667

comp SSw= diag(t(G)*D*G)/(2*t(n)).
print SSw. /*By groups
   SSw 
   27.85493333 
   17.96580000 
     .00000000 
comp SSw= csum(SSw).
print SSw. /*And summed (pooled SSw)
   SSw 
   45.82073333

comp SSb= SSt-SSw.
print SSb.
   SSb 
   43.25328333

Slower method. Use if you will need also to know some multivariate properties of the data, such as eigenvalues of the principal directions spanned by the data cloud (needed, for example, in multidimensional scaling).

Convert $\bf D$ into its double-centered matrix $\bf S$, explained here and here. As noted in the 2nd link, one of its properties is that $trace(\mathbf S)=trace(\mathbf {T})=SS_t$, and (first nonzero) eigenvalues of $\bf S$ are the same as of $\bf T$ - the scatter matrix of the original (or implied, hypothetical) cases x variables data.

comp rmean= (rsum(D)/ncol(D))*make(1,ncol(D),1).
comp S= (rmean+t(rmean)-D-msum(D)/ncol(D)**2)/2.
print S.
    S 
    6.63045    6.58188   -1.46444     .96961  -14.15579    1.43828 
    6.58188   14.51701  -11.86120    5.54205   -6.09675   -8.68299 
   -1.46444  -11.86120   13.89118   -6.18427   -7.24447   12.86320 
     .96961    5.54205   -6.18427    2.76878    2.49338   -5.58955 
  -14.15579   -6.09675   -7.24447    2.49338   38.14958  -13.14595 
    1.43828   -8.68299   12.86320   -5.58955  -13.14595   13.11701

comp SSt= trace(S).
print SSt.
   SSt 
   89.07401667

Of course, you can do likewise the double centration also on each group distance submatrix; the traces of the $\bf S$s will be each group's SSwithin, which summation yields pooled $SS_w$.

If you need to compute matrix $\bf C$ of squared euclidean distances between group centroids, compute these ingredients:

$\bf E= G'DG$;

$\mathbf Q= \frac{diag(\bf E)n^2}{2}$, (n is a row vector and diag is a column);

$\bf F=n'n$.

Then $\bf C= (E-\frac{Q+Q'}{F})/F$.

comp E= t(G)*D*G.
print E.
   E 
 167.1296 247.1716  63.1527 
 247.1716  71.8632 104.6234 
  63.1527 104.6234    .0000 

comp Q= (diag(E)*n&**2)/2.
print Q.
   Q 
 752.0832 334.2592  83.5648 
 323.3844 143.7264  35.9316 
    .0000    .0000    .0000 

comp F= sscp(n).
print F.
   F 
   9.0000   6.0000   3.0000 
   6.0000   4.0000   2.0000 
   3.0000   2.0000   1.0000 

comp C= (E-(Q+t(Q))/F)/F.
print C.
   C
    .0000  22.9274  11.7659 
  22.9274    .0000  43.3288 
  11.7659  43.3288    .0000

Of course, $SS_b$ - if you still don't know it - can be obviously obtained from it (within group frequency is the weight).

Bonus instructions. How to compute SSt, SSb, SSw when you do have the original N cases x p variables data X. Many ways are possible. One of the most efficient (fast) matrix way with data of typical size is as follows.

Matrix of group means (centroids), k groups x p variables: $\bf M= \frac{G'X}{n'[1]}$, where $[1]$ is the p-length row of ones; $\bf n$ is a row defined earlier; $\bf G$ also see above; $\bf X$ is the data with columns (variables) centered about their grand means.

Total scatter matrix $\bf T=X'X$, and $SS_t= trace(\mathbf T)$.

Between-group scatter matrix $\bf B=(GM)'(GM)$, and $SS_b= trace(\mathbf B)$.

Pooled within-group scatter matrix $\bf W=T-B$, and $SS_w= trace(\mathbf W)= SS_t-SS_b$.

X with columns centered
  -2.2333   1.2817 
  -3.6233  -1.1783 
   2.3267   2.9117 
  -1.1333  -1.2183 
   3.3667  -5.1783 
   1.2967   3.3817 

comp M= (t(G)*X)/(t(n)*make(1,ncol(X),1)).
print M.
   M 
  -1.1767   1.0050 
   1.1167  -3.1983 
   1.2967   3.3817 

comp Tot= sscp(X). /*T scatter matrix
print Tot.
print trace(Tot).
  Tot 
  37.8299  -3.4865 
  -3.4865  51.2441 
  TRACE(Tot) 
   89.07401667 

comp GM= G*M.
comp B= sscp(GM). /*B scatter matrix
print B.
print trace(B).
  B 
   8.3289  -6.3057 
  -6.3057  34.9244 
  TRACE(B) 
   43.25328333 

comp W= Tot-B. /*W scatter matrix
print W.
print trace(W).
  W 
  29.5011   2.8192 
   2.8192  16.3197 
  TRACE(W) 
   45.82073333

(If you do not center $\bf X$ initially, $\bf W=T-B$ persists, and gives the same $\bf W$ as before, however $\bf M$, $\bf T$, and $\bf B$ matrices will be different from what before.)

Best Answer

Related Solutions

Clustering Mean Euclidean – Efficient Methods to Compute Distances Between Centroids from Distance Matrix

Solved – How does the Proxy::dist package for r compute cross-distance matrix between two matrices

Related Question