Solved – PCA when the dimensionality is greater than the number of samples

dimensionality reductionpcasvd

I've come across a scenario where I have 10 signals/person for 10 people (so 100 samples) containing 14000 data points (dimensions) that I need to pass to a classifier. I would like to reduce the dimensionality of this data and PCA seems to be the way to do so. However, I've only been able to find examples of PCA where the number of samples is greater than the number of dimensions. I'm using a PCA application that finds the PCs using SVD. When I pass it my 100×14000 dataset there are 101 PCs returned so the vast majority of dimensions are obviously ignored. The program indicates the first 6 PCs contain 90% of the variance.

Is it a reasonable assumption that these 101 PCs contain essentially all the variance and the remaining dimensions are neglectable?

One of the papers I've read claims that, using a similar (though slightly lower quality) dataset than my own, they were able to reduce 4500 dimensions down to 80 retaining 96% of the original information. The paper hand-waves over the details of the PCA technique used, only 3100 samples were available, and I have reason to believe less samples than that were used to actually perform PCA (to remove bias from the classification phase).

Am I missing something or is this really the way that PCA is used with high dimensionality-low sample size dataset? Any feedback would be greatly appreciated.

Best Answer

I'd look at the problem from a slightly different angle: how complex a model can you afford with only 10 subjects / 100 samples?

And that question I usually answer with: much less than 100 PCs. Note that I work on a different type of data (vibrational spectra), so things may vary a bit. In my field a common set up would be using 10 or 25 or 50 PCs calculated from O (1000) spectra of O (10) subjects.

Here's what I'd do:

  • Look at the variance covered by those 100 PCs. I usually find that only few components really contribute to the variance in our data.

  • I very much prefer PLS as pre-treatment for clasification over PCA as it does a much better job at sorting out directions that have a high variation which does not help the classification (in my case that could be focus variations, differing sample thickness, ...). In my experience, I often get similar classifiers with 10 PLS latent variables or 25 to 50 PCs.

  • Validation samples need to be processed with the PCA rotation calculated from the training set only, otherwise the validation can (and in such extreme cases as yours most probably will) have a large overoptimistic bias.
    In other words, if you do out-of-bootstrap or cross validation, the PCA or PLS preprocessing needs to be calculated for each train/test set combination separately.