Here is what Hastie et al. have to say about it (in context of two-class LDA) in The Elements of Statistical Learning, section 4.3:
Since this derivation of the LDA direction via least squares does not use a
Gaussian assumption for the features, its applicability extends beyond the
realm of Gaussian data. However the derivation of the particular intercept
or cut-point given in (4.11) does require Gaussian data. Thus it makes
sense to instead choose the cut-point that empirically minimizes training
error for a given dataset. This is something we have found to work well in
practice, but have not seen it mentioned in the literature.
I don't fully understand the derivation via least squares they refer to, but in general [Update: I am going to summarize it briefly at some point] I think that this paragraph makes sense: even if the data are very non Gaussian or class covariances are very different, the LDA axis will probably still yield some discriminability. However, the cut-point on this axis (separating two classes) given by LDA can be completely off. Optimizing it separately can substantially improve classification.
Notice that this refers to the classification performance only. If all you are after is dimensionality reduction, then the LDA axis is all you need. So my guess is that for dimensionality reduction LDA will often do a decent job even if the assumptions are violated.
Regarding rLDA and QDA: rLDA has to be used if there are not enough data points to reliably estimate within-class covariance (and is vital in this case). And QDA is a non-linear method, so I am not sure how to use it for dimensionality reduction.
Within- and between-class scatter matrices in LDA are direct multivariate generalizations of the within- and between-class sums of squares in ANOVA. So let us consider those. The idea is to decompose the total sum of squares into two parts.
Let $x_{ij}$ be a $j$-th data point from the $i$-th class with $n_i$ data points. Total sum of squares and within-class sum of squares are given by the obvious expressions:
\begin{equation}
T = \sum_{ij} (x_{ij} - \bar x)^2 \\
W = \sum_{ij} (x_{ij} - \bar x_i)^2
\end{equation}
Let us now derive the expression for the between-class sum of squares:
\begin{equation}
x_{ij} - \bar x = (\bar x_i - \bar x) + (x_{ij} - \bar x_i) \\
(x_{ij} - \bar x)^2 = (\bar x_i - \bar x)^2 + (x_{ij} - \bar x_i)^2 + 2(\bar x_i - \bar x)(x_{ij} - \bar x_i) \\
\sum_{ij}(x_{ij} - \bar x)^2 = \sum_{ij}(\bar x_i - \bar x)^2 + \sum_{ij}(x_{ij} - \bar x_i)^2 + 2\sum_i\left[(\bar x_i - \bar x)\sum_j(x_{ij} - \bar x_i)\right] \\
T = \sum_i n_i (\bar x_i - \bar x)^2 + W
\end{equation}
and so we see that a reasonable definition for between-class sum of squares is
$$B = \sum_i n_i (\bar x_i - \bar x)^2,$$
so that $T=B+W$.
The generalization to the multivariate case is straightforward: replace all $x^2$ by $\mathbf x \mathbf x^\top$, and that's it. So the correct expression for LDA is your first formula.
As I said above in the comments, I cannot imagine any justification for the alternative formula (what you called $B^*$). In all the machine learning textbooks I know, the standard formula is always used. See e.g. Bishop's "Pattern Recognition and Machine Learning".
Update
I think I realized when the alternative formula might make sense. If the classes are very different in size, then the between-class scatter matrix $$\mathbf B=\sum_i n_i(\bar{\mathbf x}_i - \bar{ \mathbf x})(\bar{\mathbf x}_i - \bar{\mathbf x})^\top$$ will be dominated by the large classes. Imagine three classes with large $n_1$ and $n_2$, and small $n_3$. Then $\mathbf B$ will be hardly influenced by the third class at all, hence LDA will be looking for projections separating first two classes but will not care much about how well the third class is separated. This is not always desired.
One might choose to "re-balance" such an unbalanced case and define $$\mathbf B^* = \bar n \sum_i (\bar{\mathbf x}_i - \bar{ \mathbf x}^*)(\bar{\mathbf x}_i - \bar{\mathbf x}^*)^\top,$$ where $\bar{ \mathbf x}^*$ is the mean of class means and $\bar n = \sum n_i / k$ is the mean number of points per class. This puts all classes on equal footing independent of their size, and might result in more meaningful projections.
Note that this will violate the decomposition of the sum of squares: $\mathbf T = \mathbf B + \mathbf W \ne \mathbf B^* + \mathbf W$, but this can be regarded as no big deal. However, the identity can be restored if the within-class and total scatter matrix are also defined in a "balanced" way:
\begin{equation}
\mathbf T^* = \bar n \sum_{i} \frac{1}{n_i} \sum_j (\mathbf x_{ij} - \bar{\mathbf x}^*)(\mathbf x_{ij} - \bar{\mathbf x}^*)^\top \\
\mathbf W^* = \bar n \sum_{i}\frac{1}{n_i}\sum_j (\bar{\mathbf x}_{ij} - \bar{ \mathbf x}_i)(\bar{\mathbf x}_{ij} - \bar{\mathbf x}_i)^\top \\
\mathbf B^* = \bar n \sum_i (\bar{\mathbf x}_i - \bar{ \mathbf x}^*)(\bar{\mathbf x}_i - \bar{\mathbf x}^*)^\top.
\end{equation}
If all $n_i$ are equal, these formulas will coincide with the standard ones.
Best Answer
The question is quite old. I'm surprised that there's no attempt to answer. So, I will try)
Well, theoretically you are completely right. But in practice, there are few thoughts why you see it very rarely:
So, how to know? Well, my personal recommendation: think about the physical meaning of your data. What are the groups you are investigating? Is there a high possibility (or some reason) of a significant difference in groups' covariance?
Example 1 Let's see on the
iris
dataset.Check Box'M test
The null hypothesis of covariance equality is rejected. I.e. the assumption is violated.
But let's take a look at the data:
We visually see that variations are not the same in the three groups. But do they differ significantly? Doesn't seem such. Moreover, let's think about what is the data we are analyzing, i.e. petals of close species of flowers. So, it's quite natural to assume that covariance in groups is some-what the same.
Compare LDA and QDA
As one can see, the separation quality is almost the same for LDA and QDA. But the QDA model seems overfitted because of unnecessary complex borders.
Example 2 Let's simulate more significant violation of the covariance assumption. Multiplying values of one group by 2 will increase the variance by 4 times.
Theoretically, we would reject LDA and use QDA. But by projection on PCA we see that the two classes can be separated linearly.
We see that both models work well despite the assumption violation. But we would choose linear model since it is more interpretable and less probably overfitted.
Example 3 Now suppose we want to discriminate only
Versicolor
, i.e. our target now is consists of two classes:Versicolor
andnot Versicolor
. Now it's reasonable to assume that covariances are different because covariance ofnot Versicolor
group contains between group variance as well. Normality assumption is violated significantly too.Compare the variance on red points and blue points. The difference is quite significant now, right?. Theoretically, we would refuse both QDA and LDA since there is no normality and equal covariance as well. But let's see how methods would work:
Well, LDA indeed is not suitable. However, QDA model is not that bad.
So in general, I would recommend:
P.S.
I hope it's clear from my explanation, that
pairs
is a wrong tool.pairs
shows only feature-pairwise variance, but the assumption is about covariance within groups, i.e. foriris
we assume thatcov(iris[iris$Species == 'setosa', -5])
,cov(iris[iris$Species == 'versicolor', -5])
, andcov(iris[iris$Species == 'virginica', -5])
are equal.