Discriminant Analysis – Deriving Total (Within Class + Between Class) Scatter Matrix

discriminant analysis

I was fiddling with PCA and LDA methods and I am stuck at a point, I have a feeling that it is so simple that I can't see it.

Within-class ($S_W$) and between-class ($S_B$) scatter matrices are defined as:

$$
S_W = \sum_{i=1}^C\sum_{t=1}^N(x_t^i – \mu_i)(x_t^i – \mu_i)^T
$$

$$
S_B = \sum_{i=1}^CN(\mu_i-\mu)(\mu_i-\mu)^T
$$

Total scatter matrix $S_T$ is given as:

$$
S_T = \sum_{i=1}^C\sum_{t=1}^N(x_t^i – \mu)(x_t^i – \mu)^T = S_W + S_B
$$

where C is number of classes and N is number of samples $x$ are samples, $\mu_i$ is ith class mean, $\mu$ is overall mean.

While trying to derive $S_T$ I came up to a point where I had:

$$
(x-\mu_i)(\mu_i-\mu)^T + (\mu_i-\mu)(x-\mu_i)^T
$$

as a term. This needs to be zero, but why?

Indeed:

\begin{align}
S_T &= \sum_{i=1}^C\sum_{t=1}^N(x_t^i – \mu)(x_t^i – \mu)^T \\
&= \sum_{i=1}^C\sum_{t=1}^N(x_t^i – \mu_i + \mu_i – \mu)(x_t^i – \mu_i + \mu_i – \mu)^T \\
&= S_W + S_B + \sum_{i=1}^C\sum_{t=1}^N\big[(x_t^i – \mu_i)(\mu_i – \mu)^T + (\mu_i – \mu)(x_t^i – \mu_i)^T\big]
\end{align}

Best Answer

If you assume

$$\frac{1}{N}\sum_{t=1}^Nx_t^{i}=\mu_i$$

Then

$$\sum_{i=1}^C\sum_{t=1}^N(x_t^i-\mu_i)(\mu_i-\mu)^T=\sum_{i=1}^C\left(\sum_{t=1}^N(x_t^i-\mu_i)\right)(\mu_i-\mu)^T=0$$

and formula holds. You deal with the second term in the similar way.

Related Solutions

Python’s scikit-learn LDA Issues – How It Computes LDA via SVD

Update: Thanks to this discussion, scikit-learn was updated and works correctly now. Its LDA source code can be found here. The original issue was due to a minor bug (see this github discussion) and my answer was actually not pointing at it correctly (apologies for any confusion caused). As all of that does not matter anymore (bug is fixed), I edited my answer to focus on how LDA can be solved via SVD, which is the default algorithm in scikit-learn.

After defining within- and between-class scatter matrices $\boldsymbol \Sigma_W$ and $\boldsymbol \Sigma_B$, the standard LDA calculation, as pointed out in your question, is to take eigenvectors of $\boldsymbol \Sigma_W^{-1} \boldsymbol \Sigma_B$ as discriminant axes (see e.g. here). The same axes, however, can be computed in a slightly different way, exploiting a whitening matrix:

Compute $\boldsymbol \Sigma_W^{-1/2}$. This is a whitening transformation with respect to the pooled within-class covariance (see my linked answer for details).

Note that if you have eigen-decomposition $\boldsymbol \Sigma_W = \mathbf{U}\mathbf{S}\mathbf{U}^\top$, then $\boldsymbol \Sigma_W^{-1/2}=\mathbf{U}\mathbf{S}^{-1/2}\mathbf{U}^\top$. Note also that one compute the same by doing SVD of pooled within-class data: $\mathbf{X}_W = \mathbf{U} \mathbf{L} \mathbf{V}^\top \Rightarrow \boldsymbol\Sigma_W^{-1/2}=\mathbf{U}\mathbf{L}^{-1}\mathbf{U}^\top$.
Find eigenvectors of $\boldsymbol \Sigma_W^{-1/2} \boldsymbol \Sigma_B \boldsymbol \Sigma_W^{-1/2}$, let us call them $\mathbf{A}^*$.

Again, note that one can compute it by doing SVD of between-class data $\mathbf{X}_B$, transformed with $\boldsymbol \Sigma_W^{-1/2}$, i.e. between-class data whitened with respect to the within-class covariance.
The discriminant axes $\mathbf A$ will be given by $\boldsymbol \Sigma_W^{-1/2} \mathbf{A}^*$, i.e. by the principal axes of transformed data, transformed again.

Indeed, if $\mathbf a^*$ is an eigenvector of the above matrix, then $$\boldsymbol \Sigma_W^{-1/2} \boldsymbol \Sigma_B \boldsymbol \Sigma_W^{-1/2}\mathbf a^* = \lambda \mathbf a^*,$$ and multiplying from the left by $\boldsymbol \Sigma_W^{-1/2}$ and defining $\mathbf a = \boldsymbol \Sigma_W^{-1/2}\mathbf a^*$, we immediately obtain: $$\boldsymbol \Sigma_W^{-1} \boldsymbol \Sigma_B \mathbf a = \lambda \mathbf a.$$

In summary, LDA is equivalent to whitening the matrix of class means with respect to within-class covariance, doing PCA on the class means, and back-transforming the resulting principal axes into the original (unwhitened) space.

This is pointed out e.g. in The Elements of Statistical Learning, section 4.3.3. In scikit-learn this is the default way to compute LDA because SVD of a data matrix is numerically more stable than eigen-decomposition of its covariance matrix.

Note that one can use any whitening transformation instead of $\boldsymbol \Sigma_W^{-1/2}$ and everything will still work exactly the same. In scikit-learn $\mathbf{L}^{-1}\mathbf{U}^\top$ is used (instead of $\mathbf{U}\mathbf{L}^{-1}\mathbf{U}^\top$), and it works just fine (contrary to what was originally written in my answer).

Solved – Comparing four formulations of class scatter matrices

Let's go over your four definitions one by one.

Duda et al. 2012. These are the standard definitions of scatter matrices: within-class, between-class, and the total scatter matrix. They obey a nice and useful property $$S_W+S_B=S_T,$$ so one can talk about the "decomposition of the scatter matrix" similar to the "decomposition of the sum of squares" in a univariate situation (one-way ANOVA). For the purposes of linear discriminant analysis (LDA), one only needs the product $S_W^{-1}S_B$.

Scatter matrix differs from covariance matrix only by a scalar multiplier: sample covariance matrix is equal to the scatter matrix divided by $n$ (for maximum likelihood estimate) or by $n-1$ (for unbiased estimate).
Webb 2002. These definitions differ from (1) only by the $1/n$ factor; otherwise they are identical. It follows that the product $S_W^{-1}S_B$ computed using these definitions will be identical to (1) and so the definitions (1) and (2) are equivalent as far as LDA is concerned.

Of course Webb's $S_T$ is just the sample covariance matrix (ML estimate), so one might think that these definitions simply replace scatter matrices with covariances matrices. But the situation is tricky here because between-class covariance matrix is usually estimated with $C-1$ denominator (instead of $n$) and within-class covariance matrix with $n-C$ denominator: these are the respective degrees of freedom. If one uses these factors then the decomposition of total covariance matrix into between-class and within-class covariance matrices does not hold (and using the same factor does not make much sense). This is why it is easier to work with scatter matrices instead of covariance matrices and to side-step these problems.

The reason Webb 2002 uses $1/n$ factor is probably so that his $S_T$ was equal to the total covariance matrix, which is a very familiar object. However, if Webb uses $1/n$ factor and still calls it "scatter matrices" then it is a very non-standard terminology.
Johnson and Wichern 2007. This is a non-standard definition of the between-class scatter matrix (within-class one is the same here as in (1)) and the authors do not seem to motivate them in their textbook. So I can only guess at what is the rationale behind it, see my answer to What is the correct formula for between-class scatter matrix in LDA?. As I wrote there, this approach can actually be useful when the classes are unbalanced (different number of data points per class). One can call this a "re-balanced between-class scatter matrix".
@amoeba 2015. The between-class scatter matrix from (3) does not turn into the standard between-class scatter matrix from (1) when all $n_i$ are equal to each other. There is a scalar factor $\bar n$ missing.

Another problem with (3) is that there is no meaningful definition of total scatter matrix preserving the decomposition equation $S_W+S_B=S_T$. Definitions (4) were my attempt to suggest a set of definitions for re-balanced scatter matrices so that they (i) preserve the decomposition property and (ii) reduce to the standard definitions (1) when classes are balanced.

The idea is that maybe you have unequal $n_i$ due to some experimental limitations, but you would still like to have a guess at what would happen if the $n$'s were equal (perhaps you expect them to be equal in the test dataset or in the future). So even for within-class covariance matrix you want to weigh the contribution of each class equally.

No, I have never seen it described in the literature.

Best Answer

Related Solutions

Python’s scikit-learn LDA Issues – How It Computes LDA via SVD

Solved – Comparing four formulations of class scatter matrices

Related Question