Solved – Why linear discriminant analysis is sensitive to cross validation (LDA overfit problem)

classificationcross-validationdiscriminant analysisfeature selectionmachine learning

I've a set of 500+ observation (200+ feature vector dimension) of 7 classes and want improve my classification rate (with SVM or KNN).

To reduce the dimension and transform the feature matrix to a lower dimension (due to curse of dimensionality), I'm using LDA. It maps my high dimensional data to lower 6 dimensions. But with applying cross validated LDA it doesn't help and degrade the results dramatically.

When I even use leave one out (LOOCV) to calculate LDA projection matrix, it is calculated by holding out just one observation. My question is why even in this case the projection matrix ($W$) is so over-fitted and sensitive to cross validation? Intuitively I've hold out just one sample but it seems the projection matrix can't map the held out observation correctly.

I'm interested in two parts:

  • The math behind such experiment.
  • Some consideration or solution for better cross validated feature transform instead of LDA.

Update

  • based on @Andrew M, initial response, I've different number of observation per class. For example one class has example 120 observation while the other has only 40.

Best Answer

When I even use leave one out (LOOCV) to calculate LDA projection matrix, it is calculated by holding out just one observation. My question is why even in this case the projection matrix ($W$) is so over-fitted and sensitive to cross validation? Intuitively I've hold out just one sample but it seems the projection matrix can't map the held out observation correctly.

Well, the cross validation is probably doing what it is supposed to do: with almost the same training data, performance is measured. What you observe is that the models are unstable (which is one symptom of overfitting). considering your data situation, it seems totally plausible to me that the full model overfits just as badly.

Cross validation does not in itself guard against overfitting (or improve the situation) - it just tells you that you are overfitting and it is up to you to do something against that.

Keep in mind that the recommended number of training cases where you can be reasonably sure of having a stable fitting for (unregularized) linear classifiers like LDA is n > 3 to 5 p in each class. In your case that would be, say, 200 * 7 * 5 = 7000 cases, so with 500 cases you are more than an order of magnitude below that recommendation.


Suggestions:

  • As you look at LDA as a projection method, you can also check out PLS (partial least squares). It is related to LDA (Barker & Rayens: Partial least squares for discrimination J Chemom, 2003, 17, 166-173). In contrast to PCA, PLS takes the dependent variable into account for its projection. But in contrast to LDA (and like PCA) it directly offering regularization.

  • In small sample size situations where n is barely larger than p, many problems can be solved by linear classification. I'd recommend checking whether the nonlinear 2nd stage in your classification is really necessary.

  • Unstable models may be improved by switching to an aggregated (ensemble) model. While bagging is the most famous variety, you can also aggregate cross validation LDA (e.g. Beleites, C. & Salzer, R.: Assessing and improving the stability of chemometric models in small sample size situations Anal Bioanal Chem, 2008, 390, 1261-1271.
    DOI: 10.1007/s00216-007-1818-6
    )

  • Because of the pooling of the covariance matrix, I'd expect your uneven distribution of cases over the different classes to be less difficult for LDA compared to many other classifiers such as SVM. Of course this comes at the cost that a common covariance matrix may not be a good description of your data. However, if your classes are very unequal (or you even have rather ill-defined negative classes such as "something went wrong with the process") you may want to look into one-class classifiers. They typically need more training cases than discriminative classifiers, but they do have the advantage that recognition of classes where you have sufficient cases will not be compromised by classes with only few training instances, and said ill-defined classes can be described as the case belongs to none of the well-defined classes.

Related Question