Solved – Does it make sense to run LDA on several principal components and not on all variables

classificationdiscriminant analysispcaroc

I am interested in building a linear discriminant function to discriminate between 2 groups, out of 60 variables. (I'm planning to select the most discriminative of the variables for a future diagnostic test.) I have calculated the area under the ROC curve for each of these variables individually and none has an AUC greater than 0.73. I have a fairly small sample of 50 healthy and 50 diseased individuals (these are the two groups).

I have tried to reduce the number of variables using principal component analysis (PCA). There are 3 components accounting for 83% of the variation. But unfortunately, all 60 variables have similar weightings (loadings) in the 3 components, so I can't pick just few. I would ordinarily pick the highest weighted variables and then incorporate them in a linear discriminant function, but 60 is too many, especially given the small sample.

I wondered if, rather than use the 60 variables, it is possible to use the 3 principal components themselves in a linear discriminant analysis (LDA)?

Best Answer

First of all, do you have an actual indication (external knowledge) that your data consists of a few variates that carry discriminatory information among noise-only variates? There is data that can be assumed to follow such a model (e.g. gene microarray data), while other types of data have the discriminatory information "spread out" over many variates (e.g. spectroscopic data). The choice of dimension reduction technique will depend on this.

I think you may want to take a look at chapter 3.4 (Shrinkage methods) of The Elements of Statistical Learning.

Principal Component Analysis and Partial Least Squares (a supervised regression analogue to PCA) are best fit for the latter type of data.

It is certainly possible to model in the new space spanned by the selected principal components. You just take the scores of those PCs as input for the LDA. This type of model is often referred to as PCA-LDA.

I wrote a bit of a comparison between PCA-LDA and PLS-LDA (doing LDA in the PLS scores space) in my answer to "Should PCA be performed before I do classification?". Briefly, I usually prefer PLS as "preprocessing" for the LDA as it is very well adapted to situations with large numbers of (correlated) variates and (unlike PCA) it already emphasizes directions that help to discriminate the groups. PLS-DA (wihtout L) means "abusing" PLS-Regression by using dummy levels (e.g. 0 and 1, or -1 and +1) for the classes and then putting a threshold on the regression result. In my experience this is often inferior to PLS-LDA: PLS is a regression technique and as such at some point will desparately try to reduce the point clouds around the dummy levels to points (i.e. project all samples of one class to exactly 1 and all of the other to exactly 0), which leads to overfitting. LDA as a proper classification technique helps to avoid this - but it profits from the reduction of variates by the PLS.

As @January pointed out, you need to be careful with the validation of your model. However, this is easy if you keep 2 points in mind:

  • Data-driven variable reduction (or selection) such as PCA, PLS, or any picking of variables with the help of measures derived from the data is part of the model. If you do a resampling validation (iterated $k$-fold cross-validation, out-of-bootstrap) - which you should do given your restricted sample size - you need to redo this variable reduction for each of the surrogate models.
  • The same applies to data-driven (hyper)parameter selection such as determining the number of PCs or latent variables for PLS: redo this for each of the surrogate models (e.g. in an inner resampling validation loop) or fix the hyperparameters in advance. The latter is possible with a bit of experience about the particular type of data and particularly for the PCA-LDA and PLS-LDA models as they are not too sensitive for the exact number of variates. The advantage of fixing lies also in the fact that data-driven optimization is rather difficult for classification models, you should use a so-called proper scoring rule for that and you need rather large numbers of test cases.

(I cannot recommend any solution in Stata, but I could give you an R package where I implemented these combined models).


update to answer @doctorate's comment:

Yes, in priciple you can treat the PCA or PLS projection as dimensionality reduction pre-processing and do this before any other kind of classification. IMHO One should spend a few thoughts about whether this approach is appropriate for the data at hand.