Dimensionality Reduction – Is it Always Useful for Classification?

machine learningpcasvd

Is singular value decomposition almost always useful in practice for enhancing the predicative power of a trained classification model?

E.x. A dataset for classification has 20,000 features. Run SVD to convert them to top principal components and transform them to 300 features, and trained a classification model. When predict the class of a test instance, convert it to a 300d principal component feature vector, and use the trained model to predict its class.

Are there some notable real datasets of numerous features (variables) in which dimension reduction by SVD would hurt the predictive power of trained classification models?

Best Answer

I think there are two ways to look at the question whether SVD/PCA helps in general.

Is it better to use PCA reduced data instead of the raw data?

Often yes, but there are situations where PCA is not needed.

I'd in addition consider how well the bilinear concept behind PCA fits with the data generation process. I work with linear spectroscopy, which is governed by physical laws that mean that my observed spectra $\mathbf X$ are linear combinations of the spectra $\mathbf S$ of the chemical species I have, weighted by their respective concentrations $c$: $\mathbf X = \mathbf C \mathbf S$.This fits very well with the PCA model of scores $\mathbf T$ and loadings $\mathbf P$: $\mathbf X = \mathbf T \mathbf P$
I don't know of any example where PCA has hurt a model (except gross errors in setting up a combined PCA-whaterver model)

Even if the underlying relationship in your data doesn't suit that well to the bilinear approach of PCA, PCA in the first place is only a rotation of your data which would usually not hurt. Discarding higher PCs leads to the dimension reduction, but due to set up of the PCA, they carry only small amounts of variance - so again, chances are that even if it is not all that suitable, it won't hurt that much, neither.

This is also part of the bias-variance trade-off in the context of PCA as regularization technique (see @usεr11852's anwer).

Is it better to use PCA instead of some other dimension reduction technique?

The answer on this will be application specific. But if your application suggests some other way of feature generation, these features may be far more powerful than some PCs, so this is worth considering.

Again, my data and applications happen to be of a nature where PCA is a rather natural fit, so I use it and I cannot contribute a counter-example.

But: having a PCA hammer does not imply that all problems are nails... Looking for counterexamples, I'd start maybe image analyses where objects in question can appear anywhere in the picture. The people I know who deal with such tasks usually develop specialized features.

The only task I routinely have that comes close to this is detecting cosmic ray spikes in my camera signals (sharp peaks somewhere caused by cosmic rays hitting the CCD). I also use specialized filters to detect them, although they are easy to find after PCA as well. However, we describe that rather as PCA not being robust against spikes and find it disturbing.