Linear Discriminant Analysis and Bayes Rule – Classification Techniques

bayesclassificationdiscriminant analysis

What is the relation between Linear discriminant analysis and Bayes rule? I understand that LDA is used in classification by trying to minimize the ratio of within group variance and between group variance, but I don't know how Bayes rule use in it.

Best Answer

Classification in LDA goes as follows (Bayes' rule approach). [About extraction of discriminants one might look here.]

According to Bayes theorem, the sought-for probability that we're dealing with class $k$ while observing currently point $x$ is $P(k|x) = P(k)*P(x|k) / P(x)$, where

$P(k)$ – unconditional (background) probability of class $k$; $P(x)$ – unconditional (background) probability of point $x$; $P(x|k)$ – probability of presence of point $x$ in class $k$, if class being dealed with is $k$.

"Observing currently point $x$" being the base condition, $P(x)=1$, and so the denominator can be omitted. Thus, $P(k|x) = P(k)*P(x|k)$.

$P(k)$ is a prior (pre-analytical) probability that the native class for $x$ is $k$; $P(k)$ is specified by user. Usually by default all classes receive equal $P(k)$ = 1/number_of_classes. In order to compute $P(k|x)$, i.e. posterior (post-analytical) probability that the native class for $x$ is $k$, one should know $P(x|k)$.

$P(x|k)$ - probability per se - can't be found, for discriminants, the main issue of LDA, are continuous, not discrete, variables. Quantity expressing $P(x|k)$ in this case and proportional to it is the probability density (PDF function). Hereby we need to compute PDF for point $x$ in class $k$, $PDF(x|k)$, in $p$-dimensional normal distribution formed by values of $p$ discriminants. [See Wikipedia Multivariate normal distribution]

$$PDF(x|k) = \frac {e^{-d/2}} {(2\pi)^{p/2}\sqrt{\bf |S|})}$$

where $d$ – squared Mahalanobis distance [See Wikipedia Mahalanobis distance] in the discriminants' space from point $x$ to a class centroid; $\bf S$ – covariance matrix between the discriminants, observed within that class.

Compute this way $PDF(x|k)$ for each of the classes. $P(k)*PDF(x|k)$ for point $x$ and class $k$ express the sought-for $P(k)*P(x|k)$ for us. But with the above reserve that PDF isn't probability per se, only proportional to it, we should normalize $P(k)*PDF(x|k)$, dividing by the sum of $P(k)*PDF(x|k)$s over all classes. For example, if there are 3 classes in all, $k$, $l$, $m$, then

$P(k|x) = P(k)*PDF(x|k) / [P(k)*PDF(x|k)+P(l)*PDF(x|l)+P(m)*PDF(x|m)]$

Point $x$ is assigned by LDA to the class for which $P(k|x)$ is the highest.

Note. This was the general approach. Many LDA programs by default use pooled within-class matrix $\bf S$ for all classes in the formula for PDF above. If so, the formula simplifies greatly because such $\bf S$ in LDA is identity matrix (see the bottom footnote here), and hence $\bf |S|=1$ and $d$ turns into squared euclidean distance (reminder: the pooled within-class $\bf S$ we are talking about is covariances between the discriminants, - not between the input variables, which matrix is usually designated as $\bf S_w$).

Addition. Before the above Bayes rule approach to classification was introduced to LDA, Fisher, LDA pioneer, proposed computing the now so called Fisher's linear classification functions to classify points in LDA. For point $x$ the function score of belonging to class $k$ is linear combination $b_{kv1}V1_x+b_{kv2}V2_x+...+Const_k$, where $V1, V2,...V_p$ are the predictor variables in the analysis.

Coefficient $b_{kv}=(n-g)\sum_w^p{s_{vw}\bar{V}_{kw}}$, $g$ being the number of classes and $s_{vw}$ being the element of the pooled within-class scatter matrix of $p$ $V$-variables.

$Const_k=\log(P(k))-(\sum_v^p{b_{kv}\bar{V}_{kv}})/2$.

Point $x$ gets assigned to the class for which its score is the highest. Classification results obtained by this Fisher's method (which bypasses extraction of discriminants engaged in the complex eigendecomposition) are identical with those obtained by Bayes' method only if pooled within-class covariance matrix is used with Bayes' method based on discriminants (see "Note" above) and all the discriminants are being used in the classification. The Bayes' method is more general because it allows using separate within-class matrices as well.