I am trying to wrap my head around the statistical difference between Linear discriminant analysis and Logistic regression. Is my understanding right that, for a two class classification problem, LDA predicts two normal density functions (one for each class) that creates a linear boundary where they intersect, whereas logistic regression only predicts the log-odd function between the two classes, which creates a boundary but does not assume density functions for each class?
Logistic Regression vs. LDA – Two-Class Classifiers Explained
classificationdiscriminant analysislogisticregression
Related Solutions
I take it that the question is about LDA and linear (not logistic) regression.
There is a considerable and meaningful relation between linear regression and linear discriminant analysis. In case the dependent variable (DV) consists just of 2 groups the two analyses are actually identical. Despite that computations are different and the results - regression and discriminant coefficients - are not the same, they are exactly proportional to each other.
Now for the more-than-two-groups situation. First, let us state that LDA (its extraction, not classification stage) is equivalent (linearly related results) to canonical correlation analysis if you turn the grouping DV into a set of dummy variables (with one redundant of them dropped out) and do canonical analysis with sets "IVs" and "dummies". Canonical variates on the side of "IVs" set that you obtain are what LDA calls "discriminant functions" or "discriminants".
So, then how canonical analysis is related to linear regression? Canonical analysis is in essence a MANOVA (in the sense "Multivariate Multiple linear regression" or "Multivariate general linear model") deepened into latent structure of relationships between the DVs and the IVs. These two variations are decomposed in their inter-relations into latent "canonical variates". Let us take the simplest example, Y vs X1 X2 X3. Maximization of correlation between the two sides is linear regression (if you predict Y by Xs) or - which is the same thing - is MANOVA (if you predict Xs by Y). The correlation is unidimensional (with magnitude R^2 = Pillai's trace) because the lesser set, Y, consists just of one variable. Now let's take these two sets: Y1 Y2 vs X1 x2 x3. The correlation being maximized here is 2-dimensional because the lesser set contains 2 variables. The first and stronger latent dimension of the correlation is called the 1st canonical correlation, and the remaining part, orthogonal to it, the 2nd canonical correlation. So, MANOVA (or linear regression) just asks what are partial roles (the coefficients) of variables in the whole 2-dimensional correlation of sets; while canonical analysis just goes below to ask what are partial roles of variables in the 1st correlational dimension, and in the 2nd.
Thus, canonical correlation analysis is multivariate linear regression deepened into latent structure of relationship between the DVs and IVs. Discriminant analysis is a particular case of canonical correlation analysis (see exactly how). So, here was the answer about the relation of LDA to linear regression in a general case of more-than-two-groups.
Note that my answer does not at all see LDA as classification technique. I was discussing LDA only as extraction-of-latents technique. Classification is the second and stand-alone stage of LDA (I described it here). @Michael Chernick was focusing on it in his answers.
This question is not restricted to LDA, but can be asked about any binary classifier that is used in a multi-class setting by making all pairwise comparisons. The question is how to combine all pairwise classifications into one final classification.
The simplest approach is as follows. Each of the $\frac{K(K-1)}{2}$ pairwise classifiers results in a "winning" class (among the two considered). Count the number of wins for each of the classes (with the upper bound $K-1$), and assign the observation to the class with most wins. Note that this simple "voting" approach works even if your classifier does not return a probability of belonging to each of the two classes, but simply reports pairwise decisions.
When each of the pairwise classifiers reports not only pairwise decisions, but also probability of belonging to each of the two classes, more sophisticated algorithms become possible. I cannot give an overview or an advice, but there is a massively popular 2004 paper (over 1k citations according to Google Scholar) that reviews exactly this question and offers some novel methods:
- Wu, T. F., Lin, C. J., & Weng, R. C. (2004). Probability estimates for multi-class classification by pairwise coupling. The Journal of Machine Learning Research, 5, 975-1005.
I would guess, however, that in many real situations the simple voting method would already give reasonable results.
Update: In the NIPS version of the same paper the authors report performance of several methods, including the "voting" one, on several real datasets with number of classes ranging from 6 to 26, see Table 1. The voting method seems to be very competitive in each case. On some datasets it even seems to outperform all other, much more sophisticated, methods.
Best Answer
It sounds to me that you are correct. Logistic regression indeed does not assume any specific shapes of densities in the space of predictor variables, but LDA does. Here are some differences between the two analyses, briefly.
Binary Logistic regression (BLR) vs Linear Discriminant analysis (with 2 groups: also known as Fisher's LDA):
BLR: Based on Maximum likelihood estimation.
LDA: Based on Least squares estimation; equivalent to linear regression with binary predictand (coefficients are proportional and R-square = 1-Wilk's lambda).
BLR: Estimates probability (of group membership) immediately (the predictand is itself taken as probability, observed one) and conditionally.
LDA: estimates probability mediately (the predictand is viewed as binned continuous variable, the discriminant) via classificatory device (such as naive Bayes) which uses both conditional and marginal information.
BLR: Not so exigent to the level of the scale and the form of the distribution in predictors.
LDA: Predictirs desirably interval level with multivariate normal distribution.
BLR: No requirements about the within-group covariance matrices of the predictors.
LDA: The within-group covariance matrices should be identical in population.
BLR: The groups may have quite different $n$.
LDA: The groups should have similar $n$.
BLR: Not so sensitive to outliers.
LDA: Quite sensitive to outliers.
BLR: Younger method.
LDA: Older method.
BLR: Usually preferred, because less exigent / more robust.
LDA: With all its requirements met, often classifies better than BLR (asymptotic relative efficiency 3/2 time higher then).