What is the relation between Linear discriminant analysis and Bayes rule? I understand that LDA is used in classification by trying to minimize the ratio of within group variance and between group variance, but I don't know how Bayes rule use in it.
Linear Discriminant Analysis and Bayes Rule – Classification Techniques
bayesclassificationdiscriminant analysis
Related Solutions
If you have two class each with data distributed with densities f$_1$ and f$_2$ then the Bayes rule that minimizes the expected classification error loss function with equal loss for each error selects class 1 for observation vector x if f$_1$(x)/f$_2$(x)>1 and selects class 2 otherwise. The LDA becomes this Bayes rule under special conditions on the multivariate distributions f$_1$ and f$_2$. Those conditions are that f$_1$ and f$_2$ have to both be multivariate normal distributions with the same covariance matrix and presumably different mean vectors.
The details of this can be found in any of the following three sources.
I have previously given an explanation like this on the post you referenced. Is this clear to you now? Why was it not the first time?
I am adding to the answer because it is now clear that the OP is asking how to compute the LDA and not how it works in theory.
LDA creates a separating hyperplane. The hyperplane is defined as a linear combination of variables equal to a constant. f$_1$(x)/f$_2$(x)=1 defines the hyperplane. So you multiply the variables by their coefficients and sum. Then this sum is compared to the defined constant to determine which side of the hyperplane the vector x lies (i.e. whether or not f$_1$(x)>f$_2$(x) or not.
"Fisher's Discriminant Analysis" is simply LDA in a situation of 2 classes. When there is only 2 classes computations by hand are feasible and the analysis is directly related to Multiple Regression. LDA is the direct extension of Fisher's idea on situation of any number of classes and uses matrix algebra devices (such as eigendecomposition) to compute it. So, the term "Fisher's Discriminant Analysis" can be seen as obsolete today. "Linear Discriminant analysis" should be used instead. See also. Discriminant analysis with 2+ classes (multi-class) is canonical by its algorithm (extracts dicriminants as canonical variates); rare term "Canonical Discriminant Analysis" usually stands simply for (multiclass) LDA therefore (or for LDA + QDA, omnibusly).
Fisher used what was then called "Fisher classification functions" to classify objects after the discriminant function has been computed. Nowadays, a more general Bayes' approach is used within LDA procedure to classify objects.
To your request for explanations of LDA I may send you to these my answers: extraction in LDA, classification in LDA, LDA among related procedures. Also this, this, this questions and answers.
Just like ANOVA requires an assumption of equal variances, LDA requires an assumption of equal variance-covariance matrices (between the input variables) of the classes. This assumption is important for classification stage of the analysis. If the matrices substantially differ, observations will tend to be assigned to the class where variability is greater. To overcome the problem, QDA was invented. QDA is a modification of LDA which allows for the above heterogeneity of classes' covariance matrices.
If you have the heterogeneity (as detected for example by Box's M test) and you don't have QDA at hand, you may still use LDA in the regime of using individual covariance matrices (rather than the pooled matrix) of the discriminants at classification. This partly solves the problem, though less effectively than in QDA, because - as just pointed - these are the matrices between the discriminants and not between the original variables (which matrices differed).
Let me leave analyzing your example data for yourself.
Reply to @zyxue's answer and comments
LDA is what you defined FDA is in your answer. LDA first extracts linear constructs (called discriminants) that maximize the between to within separation, and then uses those to perform (gaussian) classification. If (as you say) LDA were not tied with the task to extract the discriminants LDA would appear to be just a gaussian classifier, no name "LDA" would be needed at all.
It is that classification stage where LDA assumes both normality and variance-covariance homogeneity of classes. The extraction or "dimensionality reduction" stage of LDA assumes linearity and variance-covariance homogeneity, the two assumptions together make "linear separability" feasible. (We use single pooled $S_w$ matrix to produce discriminants which therefore have identity pooled within-class covariance matrix, that give us the right to apply the same set of discriminants to classify to all the classes. If all $S_w$s are same the said within-class covariances are all same, identity; that right to use them becomes absolute.)
Gaussian classifier (the second stage of LDA) uses Bayes rule to assign observations to classes by the discriminants. The same result can be accomplished via so called Fisher linear classification functions which utilizes original features directly. However, Bayes' approach based on discriminants is a little bit general in that it will allow to use separate class discriminant covariance matrices too, in addition to the default way to use one, the pooled one. Also, it will allow to base classification on a subset of discriminants.
When there are only two classes, both stages of LDA can be described together in a single pass because "latents extraction" and "observations classification" reduce then to the same task.
Related Question
- Discriminant Analysis – Comparing Bayesian and Fisher’s Approaches
- Dimensionality Reduction – Linear Discriminant Analysis and Non-Normally Distributed Data
- Solved – Conceptual undersanding of linear discriminant analysis
- Solved – Sources’ seeming disagreement on linear, quadratic and Fisher’s discriminant analysis
- Solved – “Descriptive Discriminant Analysis”
Best Answer
Classification in LDA goes as follows (Bayes' rule approach). [About extraction of discriminants one might look here.]
According to Bayes theorem, the sought-for probability that we're dealing with class $k$ while observing currently point $x$ is $P(k|x) = P(k)*P(x|k) / P(x)$, where
$P(k)$ – unconditional (background) probability of class $k$; $P(x)$ – unconditional (background) probability of point $x$; $P(x|k)$ – probability of presence of point $x$ in class $k$, if class being dealed with is $k$.
"Observing currently point $x$" being the base condition, $P(x)=1$, and so the denominator can be omitted. Thus, $P(k|x) = P(k)*P(x|k)$.
$P(k)$ is a prior (pre-analytical) probability that the native class for $x$ is $k$; $P(k)$ is specified by user. Usually by default all classes receive equal $P(k)$ = 1/number_of_classes. In order to compute $P(k|x)$, i.e. posterior (post-analytical) probability that the native class for $x$ is $k$, one should know $P(x|k)$.
$P(x|k)$ - probability per se - can't be found, for discriminants, the main issue of LDA, are continuous, not discrete, variables. Quantity expressing $P(x|k)$ in this case and proportional to it is the probability density (PDF function). Hereby we need to compute PDF for point $x$ in class $k$, $PDF(x|k)$, in $p$-dimensional normal distribution formed by values of $p$ discriminants. [See Wikipedia Multivariate normal distribution]
$$PDF(x|k) = \frac {e^{-d/2}} {(2\pi)^{p/2}\sqrt{\bf |S|})}$$
where $d$ – squared Mahalanobis distance [See Wikipedia Mahalanobis distance] in the discriminants' space from point $x$ to a class centroid; $\bf S$ – covariance matrix between the discriminants, observed within that class.
Compute this way $PDF(x|k)$ for each of the classes. $P(k)*PDF(x|k)$ for point $x$ and class $k$ express the sought-for $P(k)*P(x|k)$ for us. But with the above reserve that PDF isn't probability per se, only proportional to it, we should normalize $P(k)*PDF(x|k)$, dividing by the sum of $P(k)*PDF(x|k)$s over all classes. For example, if there are 3 classes in all, $k$, $l$, $m$, then
$P(k|x) = P(k)*PDF(x|k) / [P(k)*PDF(x|k)+P(l)*PDF(x|l)+P(m)*PDF(x|m)]$
Point $x$ is assigned by LDA to the class for which $P(k|x)$ is the highest.
Note. This was the general approach. Many LDA programs by default use pooled within-class matrix $\bf S$ for all classes in the formula for PDF above. If so, the formula simplifies greatly because such $\bf S$ in LDA is identity matrix (see the bottom footnote here), and hence $\bf |S|=1$ and $d$ turns into squared euclidean distance (reminder: the pooled within-class $\bf S$ we are talking about is covariances between the discriminants, - not between the input variables, which matrix is usually designated as $\bf S_w$).
Addition. Before the above Bayes rule approach to classification was introduced to LDA, Fisher, LDA pioneer, proposed computing the now so called Fisher's linear classification functions to classify points in LDA. For point $x$ the function score of belonging to class $k$ is linear combination $b_{kv1}V1_x+b_{kv2}V2_x+...+Const_k$, where $V1, V2,...V_p$ are the predictor variables in the analysis.
Coefficient $b_{kv}=(n-g)\sum_w^p{s_{vw}\bar{V}_{kw}}$, $g$ being the number of classes and $s_{vw}$ being the element of the pooled within-class scatter matrix of $p$ $V$-variables.
$Const_k=\log(P(k))-(\sum_v^p{b_{kv}\bar{V}_{kv}})/2$.
Point $x$ gets assigned to the class for which its score is the highest. Classification results obtained by this Fisher's method (which bypasses extraction of discriminants engaged in the complex eigendecomposition) are identical with those obtained by Bayes' method only if pooled within-class covariance matrix is used with Bayes' method based on discriminants (see "Note" above) and all the discriminants are being used in the classification. The Bayes' method is more general because it allows using separate within-class matrices as well.