Solved – Why do we not look at the covariance matrix when choosing between LDA or QDA

data miningdiscriminant analysis

I understand the difference between LDA and QDA (linear and quadratic discriminant analysis), being that with LDA assume that your features have the same covariance matrix in each class.

I wonder why I have not seen an example yet where they actually calculate the variance covariance matrices in each class and compare them with each other?

Or maybe you can use the pairs plot for this (ref below)? Then how do I read it?

Or should't I care too much, and just choose the model based on the prediction error?

enter image description here

Thank you!

Best Answer

The question is quite old. I'm surprised that there's no attempt to answer. So, I will try)

I wonder why I have not seen an example yet where they actually calculate the variance covariance matrices in each class and compare them with each other?

Well, theoretically you are completely right. But in practice, there are few thoughts why you see it very rarely:

  • The LDA method is quite robust for unsufficient violations of assumptions of distribution normality and covariance matrix equality.
  • In contrast, using a more complicated model leads to overfitting. I.e. using QDA, where it's not necessary and LDA works well, may lead to an overfit.
  • You may check the assumptions using some hypothesis test, but... the test has its own assumptions, so it becomes like a vicious circle. For example, take a look here:

Box’s M Test is extremely sensitive to departures from normality; the fundamental test assumption is that your data is multivariate normally distributed. Therefore, if your samples don’t meet the assumption of normality, you shouldn’t use this test.

So, how to know? Well, my personal recommendation: think about the physical meaning of your data. What are the groups you are investigating? Is there a high possibility (or some reason) of a significant difference in groups' covariance?

Example 1 Let's see on the iris dataset.

  1. Check Box'M test

    > biotools::boxM(iris[,-5], iris[,5])
    
       Box's M-test for Homogeneity of Covariance Matrices
    
    data:  iris[, -5]
    Chi-Sq (approx.) = 140.94, df = 20, p-value < 2.2e-16
    

The null hypothesis of covariance equality is rejected. I.e. the assumption is violated.

  1. But let's take a look at the data:

    library(ggplot2)
    x <- iris[,-5]
    y <- iris[,5]
    pca <- prcomp(x, center = T, scale. = F)
    qplot(x=pca$x[,1], y=pca$x[,2], color=y, shape=y, xlab = 'PC 1', ylab='PC 2', size=4)
    

Principal Component Plot of iris data

We visually see that variations are not the same in the three groups. But do they differ significantly? Doesn't seem such. Moreover, let's think about what is the data we are analyzing, i.e. petals of close species of flowers. So, it's quite natural to assume that covariance in groups is some-what the same.

  1. Compare LDA and QDA

    library(klaR)
    partimat(Species ~ ., data = iris, method = "lda", plot.matrix = TRUE, imageplot = T, col.correct='green', col.wrong='red', cex=1)
    partimat(Species ~ ., data = iris, method = "qda", plot.matrix = TRUE, imageplot = T, col.correct='green', col.wrong='red', cex=1)
    

    LDA borders QDA borders

As one can see, the separation quality is almost the same for LDA and QDA. But the QDA model seems overfitted because of unnecessary complex borders.

Example 2 Let's simulate more significant violation of the covariance assumption. Multiplying values of one group by 2 will increase the variance by 4 times.

    x2 <- iris[51:150,-5]
    y2 <- factor(iris[51:150,5])
    x2[y2 == 'versicolor',] <- 2*x2[y2 == 'versicolor',]
    pca <- prcomp(x2, center = T, scale. = F)
    qplot(x=pca$x[,1], y=pca$x[,2], color=y2, shape=y2, xlab = 'PC 1', ylab='PC 2', size=4)

PCA

Theoretically, we would reject LDA and use QDA. But by projection on PCA we see that the two classes can be separated linearly.

    iris2 <- iris[51:150,]; iris2$Species <- factor(iris2$Species); iris2[51:100,-5] <- 2*iris2[51:100,-5]
    partimat(Species ~ ., data = iris2, method = "lda", plot.matrix = TRUE, imageplot = T, col.correct='green', col.wrong='red', cex=1)
    partimat(Species ~ ., data = iris2, method = "qda", plot.matrix = TRUE, imageplot = T, col.correct='green', col.wrong='red', cex=1)

LDA QDA

We see that both models work well despite the assumption violation. But we would choose linear model since it is more interpretable and less probably overfitted.

Example 3 Now suppose we want to discriminate only Versicolor, i.e. our target now is consists of two classes: Versicolor and not Versicolor. Now it's reasonable to assume that covariances are different because covariance of not Versicolor group contains between group variance as well. Normality assumption is violated significantly too.

    y3 <- as.character(y); y3[y3 != 'versicolor'] <- 'not versicolor';y3 <- factor(y3)
    qplot(x=pca$x[,1], y=pca$x[,2], color=y3, shape=y, xlab = 'PC 1', ylab='PC 2', size=4)

PCA colored by versicolor

Compare the variance on red points and blue points. The difference is quite significant now, right?. Theoretically, we would refuse both QDA and LDA since there is no normality and equal covariance as well. But let's see how methods would work:

    iris3 <- iris; iris3$Species <- y3
    partimat(Species ~ ., data = iris3, method = "lda", plot.matrix = TRUE, imageplot = T, col.correct='green', col.wrong='red', cex=1)
    partimat(Species ~ ., data = iris3, method = "qda", plot.matrix = TRUE, imageplot = T, col.correct='green', col.wrong='red', cex=1)

LDA for versicolor discrimination QDA for versicolor discrimination

    > l <- lda(Species ~ ., data = iris3)
    > table(y3, predict(l)$class)

    y3               not versicolor versicolor
    not versicolor             86         14
    versicolor                 26         24

    > q <- qda(Species ~ ., data = iris3)
    > table(y3, predict(q)$class)

    y3               not versicolor versicolor
    not versicolor             99          1
    versicolor                  4         46

Well, LDA indeed is not suitable. However, QDA model is not that bad.

So in general, I would recommend:

  1. Think about your data and task. Is there a reason for a significant violation of the assumptions?
  2. Take a look at PCA projections. If first 2-3 components explain an essential part of variance (let's say, more than 90%). By scores-plots, you may see normality of the distribution, equality of the within-group covariances, and estimate how linear discrimination would work.
  3. Try both LDA and QDA. If the difference is not essential, stick with LDA as QDA most probably is overfitted.

P.S.

Or maybe you can use the pairs plot for this (ref below)? Then how do I read it?

I hope it's clear from my explanation, that pairs is a wrong tool. pairs shows only feature-pairwise variance, but the assumption is about covariance within groups, i.e. for iris we assume that cov(iris[iris$Species == 'setosa', -5]), cov(iris[iris$Species == 'versicolor', -5]), and cov(iris[iris$Species == 'virginica', -5]) are equal.