Dimensionality Reduction – Linear Discriminant Analysis and Non-Normally Distributed Data

dimensionality reductiondiscriminant analysisnormality-assumption

If I understand correctly, a Linear Discriminant Analysis (LDA) assumes normal distributed data, independent features, and identical covariances for every class for the optimality criterion.

Since the mean and variance is estimated from the training data, isn't it already
a violation?

I found a quotation in an article (Li, Tao, Shenghuo Zhu, and Mitsunori Ogihara. “Using Discriminant Analysis for Multi-Class Classification: An Experimental Investigation.” Knowledge and Information Systems 10, no. 4 (2006): 453–72.
)

"linear discriminant analysis frequently achieves good performances in
the tasks of face and object recognition, even though the assumptions
of common covariance matrix among groups and normality are often
violated (Duda, et al., 2001)"

— unfortunately, I couldn't find the corresponding section in Duda et. al. "Pattern Classification".

Any experiences or thoughts about using LDA (vs. Regularized LDA or QDA) for non-normal data in context of dimensionality reduction?

Best Answer

Here is what Hastie et al. have to say about it (in context of two-class LDA) in The Elements of Statistical Learning, section 4.3:

Since this derivation of the LDA direction via least squares does not use a Gaussian assumption for the features, its applicability extends beyond the realm of Gaussian data. However the derivation of the particular intercept or cut-point given in (4.11) does require Gaussian data. Thus it makes sense to instead choose the cut-point that empirically minimizes training error for a given dataset. This is something we have found to work well in practice, but have not seen it mentioned in the literature.

I don't fully understand the derivation via least squares they refer to, but in general [Update: I am going to summarize it briefly at some point] I think that this paragraph makes sense: even if the data are very non Gaussian or class covariances are very different, the LDA axis will probably still yield some discriminability. However, the cut-point on this axis (separating two classes) given by LDA can be completely off. Optimizing it separately can substantially improve classification.

Notice that this refers to the classification performance only. If all you are after is dimensionality reduction, then the LDA axis is all you need. So my guess is that for dimensionality reduction LDA will often do a decent job even if the assumptions are violated.

Regarding rLDA and QDA: rLDA has to be used if there are not enough data points to reliably estimate within-class covariance (and is vital in this case). And QDA is a non-linear method, so I am not sure how to use it for dimensionality reduction.