Linear Discriminant Analysis – Correct Formula for Between-Class Scatter Matrix in LDA

covariance-matrixdiscriminant analysisunbalanced-classes

At one point in the process of applying linear discriminant analysis (LDA), one has to find the vector $v$ that maximizes the ratio $vBv'/vWv'$, where $B$ is the "between-class scatter" matrix, and $W$ is the "within-class scatter" matrix.

We are given the following: $k$ sets of $N_{i}$ ($i=1,…,k$) vectors $\mathbf{x}_{ij}$ ($i=1,…,k$; $j=1,…,N_{i}$) from $k$ classes. The class sample means are $\mathbf{\bar{x}}_{i}=\frac{1}{N_{i}}\sum_{j=1}^{N_{i}}\mathbf{x}_{ij}$.

All sources I have looked at define $W$ as follows:
$$W = \sum_{i=1}^{k}\sum_{j=1}^{N_{i}}(\mathbf{x}_{ij}-\mathbf{\bar{x}}_{i})(\mathbf{x}_{ij}-\mathbf{\bar{x}}_{i})^{T}$$

However, I have seen two different definitions for $B$. The first one, as described in Hardle et al., Applied Multivariate Statistical Analysis, 2003; Neil H. Timm, Applied Multivariate Analysis, 2002; and others, is:
$$B = \sum_{i=1}^{k}N_{i}(\mathbf{\bar{x}}_{i}-\mathbf{\bar{x}})(\mathbf{\bar{x}}_{i}-\mathbf{\bar{x}})^{T}$$

Here, $\mathbf{\bar{x}}$ is the overall mean:
$$\mathbf{\bar{x}}=\frac{1}{N}\sum_{i=1}^{k}\sum_{j=1}^{N_{i}}\mathbf{x}_{ij}=\frac{1}{N}\sum_{i=1}^{k} N_{i}\mathbf{\bar{x}}_{i},$$ with $N=\sum_{i=1}^{N}N_{i}.$

The second one, as described in: Richard A. Johnson, Dean W. Wichern, Applied Multivariate Statistical Analysis 6th Edition, 2007; the Wikipedia article on LDA; the Scholarpedia article; and others, is:
$$B^{*} = \sum_{i=1}^{k}(\mathbf{\bar{x}}_{i}-\mathbf{\bar{x}^{*}})(\mathbf{\bar{x}}_{i}-\mathbf{\bar{x}^{*}})^{T}$$
This time, $\mathbf{\bar{x}^{*}}$ is the mean of the means of the classes:
$$\mathbf{\bar{x}^{*}} = \frac{1}{k}\sum_{i=1}^{k} \mathbf{\bar{x}}_{i}$$

I have worked out that both versions of $B$ are formulas for sample variance ($B^{*}$ is standard; for $B$, see wikipedia on weighted covariance). Now, I wonder:

  • Does anyone know the reason for the discrepancy between the formulas?

  • Which formula is "better"?

  • The two formulas should be "equivalent" in some sense; but in what sense precisely?

Best Answer

Within- and between-class scatter matrices in LDA are direct multivariate generalizations of the within- and between-class sums of squares in ANOVA. So let us consider those. The idea is to decompose the total sum of squares into two parts.

Let $x_{ij}$ be a $j$-th data point from the $i$-th class with $n_i$ data points. Total sum of squares and within-class sum of squares are given by the obvious expressions:

\begin{equation} T = \sum_{ij} (x_{ij} - \bar x)^2 \\ W = \sum_{ij} (x_{ij} - \bar x_i)^2 \end{equation}

Let us now derive the expression for the between-class sum of squares: \begin{equation} x_{ij} - \bar x = (\bar x_i - \bar x) + (x_{ij} - \bar x_i) \\ (x_{ij} - \bar x)^2 = (\bar x_i - \bar x)^2 + (x_{ij} - \bar x_i)^2 + 2(\bar x_i - \bar x)(x_{ij} - \bar x_i) \\ \sum_{ij}(x_{ij} - \bar x)^2 = \sum_{ij}(\bar x_i - \bar x)^2 + \sum_{ij}(x_{ij} - \bar x_i)^2 + 2\sum_i\left[(\bar x_i - \bar x)\sum_j(x_{ij} - \bar x_i)\right] \\ T = \sum_i n_i (\bar x_i - \bar x)^2 + W \end{equation}

and so we see that a reasonable definition for between-class sum of squares is $$B = \sum_i n_i (\bar x_i - \bar x)^2,$$ so that $T=B+W$.

The generalization to the multivariate case is straightforward: replace all $x^2$ by $\mathbf x \mathbf x^\top$, and that's it. So the correct expression for LDA is your first formula.

As I said above in the comments, I cannot imagine any justification for the alternative formula (what you called $B^*$). In all the machine learning textbooks I know, the standard formula is always used. See e.g. Bishop's "Pattern Recognition and Machine Learning".

Update

I think I realized when the alternative formula might make sense. If the classes are very different in size, then the between-class scatter matrix $$\mathbf B=\sum_i n_i(\bar{\mathbf x}_i - \bar{ \mathbf x})(\bar{\mathbf x}_i - \bar{\mathbf x})^\top$$ will be dominated by the large classes. Imagine three classes with large $n_1$ and $n_2$, and small $n_3$. Then $\mathbf B$ will be hardly influenced by the third class at all, hence LDA will be looking for projections separating first two classes but will not care much about how well the third class is separated. This is not always desired.

One might choose to "re-balance" such an unbalanced case and define $$\mathbf B^* = \bar n \sum_i (\bar{\mathbf x}_i - \bar{ \mathbf x}^*)(\bar{\mathbf x}_i - \bar{\mathbf x}^*)^\top,$$ where $\bar{ \mathbf x}^*$ is the mean of class means and $\bar n = \sum n_i / k$ is the mean number of points per class. This puts all classes on equal footing independent of their size, and might result in more meaningful projections.

Note that this will violate the decomposition of the sum of squares: $\mathbf T = \mathbf B + \mathbf W \ne \mathbf B^* + \mathbf W$, but this can be regarded as no big deal. However, the identity can be restored if the within-class and total scatter matrix are also defined in a "balanced" way:

\begin{equation} \mathbf T^* = \bar n \sum_{i} \frac{1}{n_i} \sum_j (\mathbf x_{ij} - \bar{\mathbf x}^*)(\mathbf x_{ij} - \bar{\mathbf x}^*)^\top \\ \mathbf W^* = \bar n \sum_{i}\frac{1}{n_i}\sum_j (\bar{\mathbf x}_{ij} - \bar{ \mathbf x}_i)(\bar{\mathbf x}_{ij} - \bar{\mathbf x}_i)^\top \\ \mathbf B^* = \bar n \sum_i (\bar{\mathbf x}_i - \bar{ \mathbf x}^*)(\bar{\mathbf x}_i - \bar{\mathbf x}^*)^\top. \end{equation}

If all $n_i$ are equal, these formulas will coincide with the standard ones.