Could someone explain to me the difference between within-group covariance matrix and between-group covariance matrix in the context of linear discriminant analysis?
Solved – Difference between within-group and between-group covariance matrices in linear discriminant analysis
covariance-matrixdefinitiondiscriminant analysis
Related Solutions
Classification in LDA goes as follows (Bayes' rule approach). [About extraction of discriminants one might look here.]
According to Bayes theorem, the sought-for probability that we're dealing with class $k$ while observing currently point $x$ is $P(k|x) = P(k)*P(x|k) / P(x)$, where
$P(k)$ – unconditional (background) probability of class $k$; $P(x)$ – unconditional (background) probability of point $x$; $P(x|k)$ – probability of presence of point $x$ in class $k$, if class being dealed with is $k$.
"Observing currently point $x$" being the base condition, $P(x)=1$, and so the denominator can be omitted. Thus, $P(k|x) = P(k)*P(x|k)$.
$P(k)$ is a prior (pre-analytical) probability that the native class for $x$ is $k$; $P(k)$ is specified by user. Usually by default all classes receive equal $P(k)$ = 1/number_of_classes. In order to compute $P(k|x)$, i.e. posterior (post-analytical) probability that the native class for $x$ is $k$, one should know $P(x|k)$.
$P(x|k)$ - probability per se - can't be found, for discriminants, the main issue of LDA, are continuous, not discrete, variables. Quantity expressing $P(x|k)$ in this case and proportional to it is the probability density (PDF function). Hereby we need to compute PDF for point $x$ in class $k$, $PDF(x|k)$, in $p$-dimensional normal distribution formed by values of $p$ discriminants. [See Wikipedia Multivariate normal distribution]
$$PDF(x|k) = \frac {e^{-d/2}} {(2\pi)^{p/2}\sqrt{\bf |S|})}$$
where $d$ – squared Mahalanobis distance [See Wikipedia Mahalanobis distance] in the discriminants' space from point $x$ to a class centroid; $\bf S$ – covariance matrix between the discriminants, observed within that class.
Compute this way $PDF(x|k)$ for each of the classes. $P(k)*PDF(x|k)$ for point $x$ and class $k$ express the sought-for $P(k)*P(x|k)$ for us. But with the above reserve that PDF isn't probability per se, only proportional to it, we should normalize $P(k)*PDF(x|k)$, dividing by the sum of $P(k)*PDF(x|k)$s over all classes. For example, if there are 3 classes in all, $k$, $l$, $m$, then
$P(k|x) = P(k)*PDF(x|k) / [P(k)*PDF(x|k)+P(l)*PDF(x|l)+P(m)*PDF(x|m)]$
Point $x$ is assigned by LDA to the class for which $P(k|x)$ is the highest.
Note. This was the general approach. Many LDA programs by default use pooled within-class matrix $\bf S$ for all classes in the formula for PDF above. If so, the formula simplifies greatly because such $\bf S$ in LDA is identity matrix (see the bottom footnote here), and hence $\bf |S|=1$ and $d$ turns into squared euclidean distance (reminder: the pooled within-class $\bf S$ we are talking about is covariances between the discriminants, - not between the input variables, which matrix is usually designated as $\bf S_w$).
Addition. Before the above Bayes rule approach to classification was introduced to LDA, Fisher, LDA pioneer, proposed computing the now so called Fisher's linear classification functions to classify points in LDA. For point $x$ the function score of belonging to class $k$ is linear combination $b_{kv1}V1_x+b_{kv2}V2_x+...+Const_k$, where $V1, V2,...V_p$ are the predictor variables in the analysis.
Coefficient $b_{kv}=(n-g)\sum_w^p{s_{vw}\bar{V}_{kw}}$, $g$ being the number of classes and $s_{vw}$ being the element of the pooled within-class scatter matrix of $p$ $V$-variables.
$Const_k=\log(P(k))-(\sum_v^p{b_{kv}\bar{V}_{kv}})/2$.
Point $x$ gets assigned to the class for which its score is the highest. Classification results obtained by this Fisher's method (which bypasses extraction of discriminants engaged in the complex eigendecomposition) are identical with those obtained by Bayes' method only if pooled within-class covariance matrix is used with Bayes' method based on discriminants (see "Note" above) and all the discriminants are being used in the classification. The Bayes' method is more general because it allows using separate within-class matrices as well.
I will provide only a short informal answer and refer you to the section 4.3 of The Elements of Statistical Learning for the details.
Update: "The Elements" happen to cover in great detail exactly the questions you are asking here, including what you wrote in your update. The relevant section is 4.3, and in particular 4.3.2-4.3.3.
(2) Do and how the two approaches relate to each other?
They certainly do. What you call "Bayesian" approach is more general and only assumes Gaussian distributions for each class. Your likelihood function is essentially Mahalanobis distance from $x$ to the centre of each class.
You are of course right that for each class it is a linear function of $x$. However, note that the ratio of the likelihoods for two different classes (that you are going to use in order to perform an actual classification, i.e. choose between classes) -- this ratio is not going to be linear in $x$ if different classes have different covariance matrices. In fact, if one works out boundaries between classes, they turn out to be quadratic, so it is also called quadratic discriminant analysis, QDA.
An important insight is that equations simplify considerably if one assumes that all classes have identical covariance [Update: if you assumed it all along, this might have been part of the misunderstanding]. In that case decision boundaries become linear, and that is why this procedure is called linear discriminant analysis, LDA.
It takes some algebraic manipulations to realize that in this case the formulas actually become exactly equivalent to what Fisher worked out using his approach. Think of that as a mathematical theorem. See Hastie's textbook for all the math.
(1) Can we do dimension reduction using Bayesian approach?
If by "Bayesian approach" you mean dealing with different covariance matrices in each class, then no. At least it will not be a linear dimensionality reduction (unlike LDA), because of what I wrote above.
However, if you are happy to assume the shared covariance matrix, then yes, certainly, because "Bayesian approach" is simply equivalent to LDA. However, if you check Hastie 4.3.3, you will see that the correct projections are not given by $\Sigma^{-1} \mu_k$ as you wrote (I don't even understand what it should mean: these projections are dependent on $k$, and what is usually meant by projection is a way to project all points from all classes on to the same lower-dimensional manifold), but by first [generalized] eigenvectors of $\boldsymbol \Sigma^{-1} \mathbf{M}$, where $\mathbf{M}$ is a covariance matrix of class centroids $\mu_k$.
Related Question
- Dimensionality Reduction – Linear Discriminant Analysis and Non-Normally Distributed Data
- Solved – Conceptual undersanding of linear discriminant analysis
- GMM Classification vs QDA – Difference Between GMM Classification and QDA in Machine Learning
- Solved – Differences linear discriminant analysis and Gaussian mixture model
Best Answer
Within-group covariance matrix is the average of covariance matrices of each group, weighted by the groups' weight. And between-group covariance matrix is the covariance matrix of the group means (centroids), weighted by the groups' weight.
What LDA aims to achieve, is minimal variance within groups and maximal variance between groups.