I am having trouble understanding the concept of Sum of Squares in the context of distance matrices (Studer et al. 2010).
The Sum of Squares I am familiar with is the classical $SS$ from ANOVA, performed on contingency table, such as
sex FE employment joblessness school
1 16 4 0 0
2 8 3 1 8
From which I can quickly compute the ANOVA Sum of Squares with
dtm = melt(dt, id.vars = c('sex'))
dtmcount = count(dtm, sex, value)
dtmcount %>% group_by() %>%
mutate(grandmean = mean(n)) %>%
group_by(sex) %>%
mutate(SumSqtTotal = (n - grandmean)^2) %>%
group_by(sex) %>% mutate(groupmean = mean(n)) %>%
mutate(SSW = (n - groupmean)^2) %>%
group_by(sex) %>% mutate(SSB = ( grandmean - groupmean)^2) %>%
group_by() %>%
summarise(SST = sum(SumSqtTotal), SSW = sum(SSW), SSB = sum(SSB))
# results #
SST = SSW SSB
53 20 33
The Sum of Squares makes sense to me because I understand that we are comparing means and decomposing means.
However, when it comes to distance matrices, I don't understand what the "mean" becomes.
Consider the same data, but this time we are comparing sequences.
sex Sep.93 Oct.93 Nov.93 Dec.93
1 1 school school school school
2 1 training training training training
3 1 school school school school
4 1 training training training training
5 1 school school school school
6 2 FE FE FE FE
7 2 FE FE FE FE
8 2 training training training training
9 2 school school school school
10 2 employment employment employment employment
In a classical sequence analysis, we would use a distance algorithm to compute pairwise dissimilarities and end up with a distance matrix, like this one (from Hamming distance)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0 4 0 4 0 4 4 4 0 4
[2,] 4 0 4 0 4 4 4 0 4 4
[3,] 0 4 0 4 0 4 4 4 0 4
[4,] 4 0 4 0 4 4 4 0 4 4
[5,] 0 4 0 4 0 4 4 4 0 4
[6,] 4 4 4 4 4 0 0 4 4 4
[7,] 4 4 4 4 4 0 0 4 4 4
[8,] 4 0 4 0 4 4 4 0 4 4
[9,] 0 4 0 4 0 4 4 4 0 4
[10,] 4 4 4 4 4 4 4 4 4 0
The analysis of variance for sequence analysis was developed by Studer et al. (2010).
From the paper (2010), I quote the following :
According to Batagelj (1988), the notion of a gravity center holds for any kind of distances and objects, even though it is not clearly defined for complex nonnumeric objects such as sequences. It is likely that the gravity center does not itself belong to the object space, exactly as the mean of integer values may be a real noninteger value. […] Even though the gravity center may not be observable, equation (4) provides a comprehensive way to compute the most central sequence, the medoid, of a set using weights. Searching the x that minimizes equation (4) is equivalent to minimizing the sum of the weighted distances from x to all other sequences. (p.8).
They developed a R function in the library TraMineR
. It actually enables to get p.values from co-variates using permutation tests.
The output looks like this :
Pseudo ANOVA table:
SS df MSE
Exp 1.7 1 1.700000
Res 9.6 8 1.200000
Total 11.3 9 1.255556
However, I fail to completely understand how to compute the Sum of Squares for such matrix (manually if possible), for both the Total
and the Explanatory
? What is the "mean" or "center" in this context ?
Thank you.
Data and codes
library(dplyr)
library(reshape2)
library(TraMineR)
# data from TraMineR #
dt = structure(list(sex = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L), .Label = c("1", "2"), class = "factor"), Sep.93 = structure(c(3L,
4L, 3L, 4L, 3L, 1L, 1L, 4L, 3L, 2L), .Label = c("FE", "employment",
"school", "training"), class = "factor"), Oct.93 = structure(c(3L,
4L, 3L, 4L, 3L, 1L, 1L, 4L, 3L, 2L), .Label = c("FE", "employment",
"school", "training"), class = "factor"), Nov.93 = structure(c(3L,
4L, 3L, 4L, 3L, 1L, 1L, 4L, 3L, 2L), .Label = c("FE", "employment",
"school", "training"), class = "factor"), Dec.93 = structure(c(3L,
4L, 3L, 4L, 3L, 1L, 1L, 4L, 3L, 2L), .Label = c("FE", "employment",
"school", "training"), class = "factor")), .Names = c("sex",
"Sep.93", "Oct.93", "Nov.93", "Dec.93"), row.names = c(NA, -10L
), class = "data.frame")
# To transform the data into COUNT data #
dtm = melt(dt, id.vars = c('sex'))
dtmcount = count(dtm, sex, value)
# The `SS` is easily compute with #
dtmcount %>% group_by() %>%
mutate(grandmean = mean(n)) %>% group_by(sex) %>%
mutate(SumSqtTotal = (n - grandmean)^2) %>%
group_by(sex) %>% mutate(groupmean = mean(n)) %>%
mutate(SSW = (n - groupmean)^2) %>%
group_by(sex) %>% mutate(SSB = ( grandmean - groupmean)^2) %>% group_by() %>%
summarise(SST = sum(SumSqtTotal), SSW = sum(SSW), SSB = sum(SSB))
# Hamming distance function #
Ham = function(d){
mat = matrix(0, nrow(d), nrow(d))
len = nrow(d)
mat = matrix(0, len, len)
for(k in 1:len){
for(i in 1:len){
mat[k,i] = sum( ifelse( as.numeric( d[k, ] == d[i, ] ) == 1, 0 , 1) )
}
}
return(mat)
}
used in the example, like this
Ham(dt[,-1])
# The TraMineR Example #
data(mvad)
library(dplyr)
set.seed(10)
mv = mvad %>% group_by(male) %>% sample_n(5)
# compute the dissimilarity matrix
mvad.ham <- seqdist(mvad.seq, method="HAM")
# compute the discrepancy analysis
d = dissassoc(mvad.ham, group = mv$male, R=10)
Ref :
Studer, Matthias, et al. "Discrepancy analysis of state sequences." Sociological Methods & Research 40.3 (2011): 471-510.
Gabadinho, Alexis, et al. "Analyzing and visualizing state sequences in R with TraMineR." Journal of Statistical Software 40.4 (2011): 1-37.
Best Answer
Instruction how you can compute sums of squares SSt, SSb, SSw out of matrix of distances (euclidean) between cases (data points) without having at hand the
cases x variables
dataset. You don't need to know the centroids' coordinates (the group means) - they pass invisibly "on the background": euclidean geometry laws allow so.Let $\bf D$ be the
N x N
matrix of squared euclidean distances between the cases, and $G$ is theN x 1
column of group labels (k
groups). Create binary dummy variables, akaN x k
design matrix: $\mathbf G=design(G)$.[I'll accompany the formulas with example data shared with this answer. The code is SPSS Matrix session syntax, almost a pseudocode-easy to understand.]
Quick method. Use if you want just the above three scalars. As mentioned in here or here, the sum of squared deviations from centroid is equal to the sum of pairwise squared Euclidean distances divided by the number of points. Then follows:
Total sum-of-squares (of deviations from grand centroid): $SS_t= \frac{\sum \bf D}{2N}$, where $\sum$ is the sum in the entire matrix.
Pooled within-group sum-of-squares (of deviations from group centroids): $SS_w= \sum \frac{diag(\bf G'DG)}{2\bf n'}$, where $\bf n$ is the
k
-length row vector of within-group frequencies, i.e. column sums in $\bf G$. Without the summation $\sum$ you have thek
-length column vector: $SS_w$ in each group.Between-group sum-of-squares is, of course, $SS_b=SS_t-SS_w$.
Slower method. Use if you will need also to know some multivariate properties of the data, such as eigenvalues of the principal directions spanned by the data cloud (needed, for example, in multidimensional scaling).
Convert $\bf D$ into its double-centered matrix $\bf S$, explained here and here. As noted in the 2nd link, one of its properties is that $trace(\mathbf S)=trace(\mathbf {T})=SS_t$, and (first nonzero) eigenvalues of $\bf S$ are the same as of $\bf T$ - the scatter matrix of the original (or implied, hypothetical)
cases x variables
data.Of course, you can do likewise the double centration also on each group distance submatrix; the traces of the $\bf S$s will be each group's SSwithin, which summation yields pooled $SS_w$.
If you need to compute matrix $\bf C$ of squared euclidean distances between group centroids, compute these ingredients:
$\bf E= G'DG$;
$\mathbf Q= \frac{diag(\bf E)n^2}{2}$, (n is a row vector and diag is a column);
$\bf F=n'n$.
Then $\bf C= (E-\frac{Q+Q'}{F})/F$.
Of course, $SS_b$ - if you still don't know it - can be obviously obtained from it (within group frequency is the weight).
Bonus instructions. How to compute SSt, SSb, SSw when you do have the original
N cases x p variables
data X. Many ways are possible. One of the most efficient (fast) matrix way with data of typical size is as follows.Matrix of group means (centroids),
k groups x p variables
: $\bf M= \frac{G'X}{n'[1]}$, where $[1]$ is thep
-length row of ones; $\bf n$ is a row defined earlier; $\bf G$ also see above; $\bf X$ is the data with columns (variables) centered about their grand means.Total scatter matrix $\bf T=X'X$, and $SS_t= trace(\mathbf T)$.
Between-group scatter matrix $\bf B=(GM)'(GM)$, and $SS_b= trace(\mathbf B)$.
Pooled within-group scatter matrix $\bf W=T-B$, and $SS_w= trace(\mathbf W)= SS_t-SS_b$.
(If you do not center $\bf X$ initially, $\bf W=T-B$ persists, and gives the same $\bf W$ as before, however $\bf M$, $\bf T$, and $\bf B$ matrices will be different from what before.)