The question is quite old. I'm surprised that there's no attempt to answer. So, I will try)
I wonder why I have not seen an example yet where they actually calculate the variance covariance matrices in each class and compare them with each other?
Well, theoretically you are completely right. But in practice, there are few thoughts why you see it very rarely:
- The LDA method is quite robust for unsufficient violations of assumptions of distribution normality and covariance matrix equality.
- In contrast, using a more complicated model leads to overfitting. I.e. using QDA, where it's not necessary and LDA works well, may lead to an overfit.
- You may check the assumptions using some hypothesis test, but... the test has its own assumptions, so it becomes like a vicious circle. For example, take a look here:
Box’s M Test is extremely sensitive to departures from normality; the fundamental test assumption is that your data is multivariate normally distributed. Therefore, if your samples don’t meet the assumption of normality, you shouldn’t use this test.
So, how to know? Well, my personal recommendation: think about the physical meaning of your data. What are the groups you are investigating? Is there a high possibility (or some reason) of a significant difference in groups' covariance?
Example 1 Let's see on the iris
dataset.
Check Box'M test
> biotools::boxM(iris[,-5], iris[,5])
Box's M-test for Homogeneity of Covariance Matrices
data: iris[, -5]
Chi-Sq (approx.) = 140.94, df = 20, p-value < 2.2e-16
The null hypothesis of covariance equality is rejected. I.e. the assumption is violated.
But let's take a look at the data:
library(ggplot2)
x <- iris[,-5]
y <- iris[,5]
pca <- prcomp(x, center = T, scale. = F)
qplot(x=pca$x[,1], y=pca$x[,2], color=y, shape=y, xlab = 'PC 1', ylab='PC 2', size=4)
We visually see that variations are not the same in the three groups. But do they differ significantly? Doesn't seem such. Moreover, let's think about what is the data we are analyzing, i.e. petals of close species of flowers. So, it's quite natural to assume that covariance in groups is some-what the same.
Compare LDA and QDA
library(klaR)
partimat(Species ~ ., data = iris, method = "lda", plot.matrix = TRUE, imageplot = T, col.correct='green', col.wrong='red', cex=1)
partimat(Species ~ ., data = iris, method = "qda", plot.matrix = TRUE, imageplot = T, col.correct='green', col.wrong='red', cex=1)
As one can see, the separation quality is almost the same for LDA and QDA. But the QDA model seems overfitted because of unnecessary complex borders.
Example 2
Let's simulate more significant violation of the covariance assumption. Multiplying values of one group by 2 will increase the variance by 4 times.
x2 <- iris[51:150,-5]
y2 <- factor(iris[51:150,5])
x2[y2 == 'versicolor',] <- 2*x2[y2 == 'versicolor',]
pca <- prcomp(x2, center = T, scale. = F)
qplot(x=pca$x[,1], y=pca$x[,2], color=y2, shape=y2, xlab = 'PC 1', ylab='PC 2', size=4)
Theoretically, we would reject LDA and use QDA. But by projection on PCA we see that the two classes can be separated linearly.
iris2 <- iris[51:150,]; iris2$Species <- factor(iris2$Species); iris2[51:100,-5] <- 2*iris2[51:100,-5]
partimat(Species ~ ., data = iris2, method = "lda", plot.matrix = TRUE, imageplot = T, col.correct='green', col.wrong='red', cex=1)
partimat(Species ~ ., data = iris2, method = "qda", plot.matrix = TRUE, imageplot = T, col.correct='green', col.wrong='red', cex=1)
We see that both models work well despite the assumption violation. But we would choose linear model since it is more interpretable and less probably overfitted.
Example 3
Now suppose we want to discriminate only Versicolor
, i.e. our target now is consists of two classes: Versicolor
and not Versicolor
. Now it's reasonable to assume that covariances are different because covariance of not Versicolor
group contains between group variance as well. Normality assumption is violated significantly too.
y3 <- as.character(y); y3[y3 != 'versicolor'] <- 'not versicolor';y3 <- factor(y3)
qplot(x=pca$x[,1], y=pca$x[,2], color=y3, shape=y, xlab = 'PC 1', ylab='PC 2', size=4)
Compare the variance on red points and blue points. The difference is quite significant now, right?. Theoretically, we would refuse both QDA and LDA since there is no normality and equal covariance as well. But let's see how methods would work:
iris3 <- iris; iris3$Species <- y3
partimat(Species ~ ., data = iris3, method = "lda", plot.matrix = TRUE, imageplot = T, col.correct='green', col.wrong='red', cex=1)
partimat(Species ~ ., data = iris3, method = "qda", plot.matrix = TRUE, imageplot = T, col.correct='green', col.wrong='red', cex=1)
> l <- lda(Species ~ ., data = iris3)
> table(y3, predict(l)$class)
y3 not versicolor versicolor
not versicolor 86 14
versicolor 26 24
> q <- qda(Species ~ ., data = iris3)
> table(y3, predict(q)$class)
y3 not versicolor versicolor
not versicolor 99 1
versicolor 4 46
Well, LDA indeed is not suitable. However, QDA model is not that bad.
So in general, I would recommend:
- Think about your data and task. Is there a reason for a significant violation of the assumptions?
- Take a look at PCA projections. If first 2-3 components explain an essential part of variance (let's say, more than 90%). By scores-plots, you may see normality of the distribution, equality of the within-group covariances, and estimate how linear discrimination would work.
- Try both LDA and QDA. If the difference is not essential, stick with LDA as QDA most probably is overfitted.
P.S.
Or maybe you can use the pairs plot for this (ref below)? Then how do I read it?
I hope it's clear from my explanation, that pairs
is a wrong tool. pairs
shows only feature-pairwise variance, but the assumption is about covariance within groups, i.e. for iris
we assume that cov(iris[iris$Species == 'setosa', -5])
, cov(iris[iris$Species == 'versicolor', -5])
, and cov(iris[iris$Species == 'virginica', -5])
are equal.
Best Answer
In a scenario with $N$ samples and $K$ classes or labels, The first formula should be
$$\frac{1}{N-K} \sum_{c=1}^K \sum_{y_i = c} (x_i - \hat \mu_c) (x_i - \hat \mu_c)^\intercal$$
and is for calculating the pooled variance, to be used if you're tying the covariance matrix across classes (as in LDA). The $N-K$ term arises from Bessel's correction.
If you're not tying the covariance matrices (as in QDA), then the covariance matrix for a class $c$ with $N_c$ samples is
$$\frac{1}{N_c - 1} \sum_{y_i = c} (x_i - \hat \mu_c) (x_i - \hat \mu_c)^\intercal$$
if you want an unbiased estimate of the variance, or
$$\frac{1}{N_c} \sum_{y_i = c} (x_i - \hat \mu_c) (x_i - \hat \mu_c)^\intercal$$
if you want an MSE estimate of the variance.
Either way, usually you don't calculate the equation of the decision boundary in QDA. Given a test point you just evaluate the posterior probability of each class, and pick the highest.