Solved – Sources’ seeming disagreement on linear, quadratic and Fisher’s discriminant analysis

discriminant analysismultivariate analysis

I'm studying discriminant analysis, but I'm having a difficult time reconciling several different explanations. I believe I must be missing something, because I've never encountered this (seeming) level of discrepancy before. That being said, the number of questions about discriminant analysis on this website seems to be a testament to its complexity.

LDA and QDA for several classes

My main text book is Johnson & Wichern Applied Multivariate Statistical Analysis (AMSA) and my teacher's notes based on this. I'll ignore the two group setting, because I believe the simplified formula's in this setting are causing at least some of the confusion. According to this source LDA and QDA are defined as a parametric (assuming multivariate normality) extension of a classification rule based on the expected cost of misclassification (ECM). The ECM sums over the conditional expected cost for classifying a new observation x to any group (incorporating misclassification costs and prior probabilities) and we choose classification regions that minimize this. $$ECM = \sum_{i=1}^{groups} p_i [\sum_{k=1;\space i \ne k}^{groups}P(k|i)c(k|i)]$$ where $P(k|i) = P(\text{classifying item as group k } | \text{ item is group i}) = \int_{R_k} f_i(\boldsymbol{x})d\boldsymbol{x}$ , $ f_i(\boldsymbol{x})$ is the population density, $R_k$ is the set of observations in group k, $c$ is the cost and $p_i$ are the prior probabilities.
New observations can then be assigned to the group for which the inner term is smallest or equivalently for which the left out part of the inner term $p_k f_k(\boldsymbol{x})$ is largest

Supposedly this classification rule is equivalent to "one that maximizes the posterior probabilities"(sic AMSA), which I can only assume is the Bayes' approach I've seen mentioned. Is this correct? And is ECM an older method, because I've never seen it occur anywhere else.

For normal populations this rule simplifies to the quadratic discriminant score: $$d_i^Q(\boldsymbol{x}) = -\frac{1}{2} log(\boldsymbol{\Sigma_i}) -\frac{1}{2} (\boldsymbol{x – \mu_i})^T \boldsymbol{\Sigma}_i^{-1}(\boldsymbol{x – \mu_i}) + log(p_i)$$.

This seems equivalent to The Elements of Statistical Learning (ESL) formula 4.12 on page 110, although they describe it as a quadratic discriminant function rather than a score. Moreover, they arrive here through the log-ratio of multivariate densities (4.9). Is this yet another name for Bayes' approach?

When we assume equal covariance the formula simplifies even further to the linear discriminant score.

$$d_i(\boldsymbol{x}) = \boldsymbol{\mu_i}^T \boldsymbol{\Sigma}^{-1}\boldsymbol{x} -\frac{1}{2} \boldsymbol{\mu_i}^T \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu_i} + log(p_i)$$

This formula does differ from ESL (4.10), where the first term is reversed: $x^T \boldsymbol{\Sigma}^{-1}\mu_k$. The ESL version is also the one listed in Statistical Learning in R. Moreover, in SAS output presented in AMSA a linear discriminant function is described consisting of a constant $0.5 \bar{X}_j^T COV^{-1}\bar{X}_j + ln \text{ prior}_j$ and a coefficient vector $COV^{-1}\bar{X}_j$, seemingly consistent with the ESL version.

What could be the reason behind this discrepancy?

Discriminants and Fisher's method

Note: if this question is deemed too large I will remove this section and open a new question, but it builds on the previous section. Apologies for the wall of text regardless, I tried my best to structure it somewhat, but I'm sure my confusion about this method has lead to some rather odd jumps of logic.

The AMSA book goes on to describe fisher's method, also for several groups. However, ttnphns has pointed out multiple times that FDA is simply LDA with two groups. What is this multiclass FDA then? Perhaps FDA can have multiple meanings?

AMSA describes Fisher's discriminants as the eigenvectors of $\boldsymbol{W^{-1}B}$ which maximize the ratio $\boldsymbol{\frac{\hat{a}^TB\hat{a}}{\hat{a}^TW\hat{a}}}$. The linear combinations $\boldsymbol{\hat{e}_ix}$ are then the sample discriminants (of which there are $min(g-1, p)$). For classification we choose the group k with the smallest value for $\sum_{j=1}^{r}[\boldsymbol{\hat{e}_j^T}(\boldsymbol{x}-\boldsymbol{\bar{x}}_k)]^2$ where r is the number of discriminants we would like to use. If we use all the discriminants this rule would be equivalent to the linear discriminant function.

Many explanations about LDA seem to describe the methodology that is called FDA in the AMSA book, i.e. starting from this between/within variability aspect. What is then meant by FDA if not the decomposition of the BW matrices?

This is the first time that the text book mentions the dimension reduction aspect of discriminant analysis, whereas several answers on this site emphasize the two-stage nature of this technique, but that this is not clear in a two group setting because there is only 1 discriminant. Given the above formula's for multiclass LDA and QDA it is still not apparent to me where the discriminants show up.

This comment especially left me confused, noting that the Bayes classification could essentially be performed on the original variables. But if FDA and LDA are mathematically equivalent as pointed out by the book and here, shouldn't the dimensionality reduction be inherent to the functions $d_i$? I believe this is what that last link is addressing, but I'm not entirely sure.

My teacher's course notes go on to explain that FDA is essentially a form of canonical correlation analysis. I've only found 1 other source which talks about this aspect, but it once again seems to be tied closely to the Fisher approach of decomposing the between and within variability. SAS presents a result in its LDA/QDA procedure (DISCRIM) that apparently is related to Fisher's method (https://stats.stackexchange.com/a/105116/62518). However, SAS' FDA option (CANDISC) essentially performs a canonical correlation, without presenting these so called Fisher's classification coefficients. It does present raw canonical coefficients which I believe are equivalent to R's W-1B eigenvectors obtained by lda (MASS) (https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_candisc_sect019.htm). The classification coefficients seem to be obtained from the discriminant function I described in my LDA and QDA section (since there is 1 function per population and we choose the largest one).

I'd be grateful for any and all clarifications or references to sources that could help me see the forest through the trees. The main cause of my confusion seems to be that different text books call methods by different names or present a slight variation of the mathematics, without acknowledging the other possibilities, although I guess this should not come as a surprise considering the age of the AMSA book.

Best Answer

I'm addressing only to one aspect of the question, and doing it intuitively without algebra.

If the $g$ classes have the same variance-covariance matrices and differ only by the shift of their centroids in the $p$-dimensional space then they are completely linearly separable in the $q=min(g-1,p)$ "subspace". This is what LDA is doing. Imagine you have three identical ellipsoids in the space of variables $V_1, V_2, V_3$. You have to use the information from all the variables in order to predict the class membership without error. But due to the fact that these were identically sized and oriented clouds it is possible to rescale them by a common transform into balls of unit radius. Then $q=g-1=2$ independent dimensions will suffice to predict the class membership as precisely as formerly. These dimensions are called discriminant functions $D_1, D_2$. Having 3 same-size balls of points you need only 2 axial lines and to know the balls' centres coordinates onto them in order to assign every point correctly.

enter image description here

Discriminants are uncorrelated variables, their within-class covariance matrices are ideally identity ones (the balls). Discriminants form a subspace of the original variables space - they are their linear combinations. However, they are not rotation-like (PCA-like) axes: seen in the original variables space, discriminants as axes are not mutually orthogonal.

So, under the assumption of homogeneity of within-class variance-covariances LDA using for classification all the existing discriminants is no worse than classifying immediately by the original variables. But you don't have to use all the discriminants. You might use only $m<q$ first most strong / statistically significant of them. This way you lose minimal information for classifying and the missclassification will be minimal. Seen from this perspective, LDA is a data reduction similar to PCA, only supervised.

Note that assuming the homogeneity (+ multivariate normality) and provided that you plan to use but all the discriminants in classification it is possible to bypass the extraction of the discriminants themselves - which involves generalized eigenproblem - and compute the so called "Fisher's classification functions" from the variables directly, in order to classify with them, with the equivalent result. So, when the $g$ classes are identical in shape we could consider the $p$ input variables or the $g$ Fisher's functions or the $q$ discriminants as all equivalent sets of "classifiers". But discriminants are more convenient in many respect.$^1$

Since usually the classes are not "identical ellipses" in reality, the classification by the $q$ discriminants is somewhat poorer than if you do Bayes classification by all the $p$ original variables. For example, on this plot the two ellipsoids are not parallel to each other; and one can visually grasp that the single existing discriminant is not enough to classify points as accurately as the two variables allow to. QDA (quadratic discriminant analysis) would be then a step better approximation than LDA. A practical approach half-way between LDA and QDA is to use LDA-discriminants but use their observed separate-class covariance matrices at classification (see,see) instead of their pooled matrix (which is the identity).

(And yes, LDA can be seen as closely related to, even a specific case of, MANOVA and Canonical correlation analysis or Reduced rank multivariate regression - see, see, see.)


$^1$ An important terminological note. In some texts the $g$ Fisher's classification functions may be called "Fisher's discriminant functions", which may confuse with the $q$ discriminats which are canonical discriminant functions (i.e. obtained in the eigendecomposition of $\bf W^{-1}B$). For clarity, I recommend to say "Fisher's classification functions" vs "canonical discriminant functions" (= discriminants, for short). In modern understanding, LDA is the canonical linear discriminant analysis. "Fisher's discriminant analysis" is, at least to my awareness, either LDA with 2 classes (where the single canonical discriminant is inevitably the same thing as the Fisher's classification functions) or, broadly, the computation of Fisher's classification functions in multiclass settings.