Solved – Comparing four formulations of class scatter matrices

covariance-matrixdiscriminant analysisunbalanced-classes

I am trying to decide between / reconcile four formulations for class scatter matrices.

The first from Duda et al. (2012), p.544 has (with symbols modified):

$$m_i = \frac{1}{n_i} \sum_{x\in \mathcal{D}_i}x$$

$$m = \frac{1}{n}\sum_{\mathcal{D}}x$$

$$S_W = \sum_{i=1}^C \sum_{x \in\mathcal{D}_i}(x -m_i)(x-m_i)^T$$

$$S_B = \sum_{i=1}^Cn_i(m_i – m)(m_i – m)^T$$

$$S_T = \sum_{x \in \mathcal{D}}(x-m)(x-m)^T$$

where $\mathcal{D}_i$ is the $i$th class, $m_i$ is the class mean, $m$ the overall mean, $C$ the number of classes, $n_i$ the number of items in class $\mathcal{D}_i$. $\mathcal{D}$ is the set of all classes.

The second formulation, from this course page, which is equivalent to the one in Webb (2002), p.375:

$$P_i = n_i / n$$

$$S_W = \frac{1}{n} \sum_{i=1}^C \sum_{x \in\mathcal{D}_i}(x -m_i)(x-m_i)^T$$

$$S_B = \sum_{i=1}^{C} P_i (m_i – m)(m_i – m)^T$$

$$S_T = \frac{1}{n} \sum_{x \in \mathcal{D}}(x-m)(x-m)^T$$

where $P_i$ is the a priori probability for class $\mathcal{D}_i$.

A third formulation is from Johnson and Wichern (2007), p.623:

$$\bar{m} = \frac{1}{C}\sum_{i=1}^C m_i$$

$$S_W = \sum_{i=1}^C \sum_{x \in\mathcal{D}_i}(x -m_i)(x-m_i)^T$$

$$S_B = \sum_{i=1}^{C} (m_i – \bar{m})(m_i – \bar{m})^T$$

$$S_T = S_B + S_W$$

where $\bar{m}$ is the mean of class means. $S_T$ is not given directly.

And finally, a fourth formulation, proposed in this post:

$$\bar{n} = \frac{\sum n_i}{C}$$

$$S_W = \bar{n} \sum_{i=1}^C \frac{1}{n_i} \sum_{x \in\mathcal{D}_i}(x -m_i)(x-m_i)^T$$

$$S_B = \bar{n} \sum_{i=1}^{C} (m_i – \bar{m})(m_i – \bar{m})^T$$

$$S_T = \bar{n} \sum_{i=1}^{C} \frac{1}{n_i} \sum_{x \in \mathcal{D}}(x-\bar{m})(x-\bar{m})^T$$

where $\bar{n}$ is the mean number of points per class, and $\bar{m}$ is the mean of class means.

As far as I can tell, the main difference is in the choice of covariance matrix. It looks to me like Duda is based on scatter matrices, Webb on the maximum likelihood estimate of the covariance matrix and Johnson on sample scatter matrices.

Can anyone detail the difference/effects? What is the use case for each? Has anyone seen formulation 4 in a more authoritative source?

Best Answer

Let's go over your four definitions one by one.

Duda et al. 2012. These are the standard definitions of scatter matrices: within-class, between-class, and the total scatter matrix. They obey a nice and useful property $$S_W+S_B=S_T,$$ so one can talk about the "decomposition of the scatter matrix" similar to the "decomposition of the sum of squares" in a univariate situation (one-way ANOVA). For the purposes of linear discriminant analysis (LDA), one only needs the product $S_W^{-1}S_B$.

Scatter matrix differs from covariance matrix only by a scalar multiplier: sample covariance matrix is equal to the scatter matrix divided by $n$ (for maximum likelihood estimate) or by $n-1$ (for unbiased estimate).
Webb 2002. These definitions differ from (1) only by the $1/n$ factor; otherwise they are identical. It follows that the product $S_W^{-1}S_B$ computed using these definitions will be identical to (1) and so the definitions (1) and (2) are equivalent as far as LDA is concerned.

Of course Webb's $S_T$ is just the sample covariance matrix (ML estimate), so one might think that these definitions simply replace scatter matrices with covariances matrices. But the situation is tricky here because between-class covariance matrix is usually estimated with $C-1$ denominator (instead of $n$) and within-class covariance matrix with $n-C$ denominator: these are the respective degrees of freedom. If one uses these factors then the decomposition of total covariance matrix into between-class and within-class covariance matrices does not hold (and using the same factor does not make much sense). This is why it is easier to work with scatter matrices instead of covariance matrices and to side-step these problems.

The reason Webb 2002 uses $1/n$ factor is probably so that his $S_T$ was equal to the total covariance matrix, which is a very familiar object. However, if Webb uses $1/n$ factor and still calls it "scatter matrices" then it is a very non-standard terminology.
Johnson and Wichern 2007. This is a non-standard definition of the between-class scatter matrix (within-class one is the same here as in (1)) and the authors do not seem to motivate them in their textbook. So I can only guess at what is the rationale behind it, see my answer to What is the correct formula for between-class scatter matrix in LDA?. As I wrote there, this approach can actually be useful when the classes are unbalanced (different number of data points per class). One can call this a "re-balanced between-class scatter matrix".
@amoeba 2015. The between-class scatter matrix from (3) does not turn into the standard between-class scatter matrix from (1) when all $n_i$ are equal to each other. There is a scalar factor $\bar n$ missing.

Another problem with (3) is that there is no meaningful definition of total scatter matrix preserving the decomposition equation $S_W+S_B=S_T$. Definitions (4) were my attempt to suggest a set of definitions for re-balanced scatter matrices so that they (i) preserve the decomposition property and (ii) reduce to the standard definitions (1) when classes are balanced.

The idea is that maybe you have unequal $n_i$ due to some experimental limitations, but you would still like to have a guess at what would happen if the $n$'s were equal (perhaps you expect them to be equal in the test dataset or in the future). So even for within-class covariance matrix you want to weigh the contribution of each class equally.

No, I have never seen it described in the literature.

Best Answer

Related Solutions

Discriminant Analysis – Deriving Total (Within Class + Between Class) Scatter Matrix

Python’s scikit-learn LDA Issues – How It Computes LDA via SVD

Related Question