PCA – Is PCA Always Recommended?

classificationdimensionality reductionpca

I was wondering if PCA can be always applied for dimensionality reduction before a classification or regression problem. My intuition tells me that the answer is no.

If we perform PCA then we calculate linear combinations of the features to build principal components that explain most of the variance of the dataset. However, we might be leaving out features that do not explain much of the variance of the dataset but do explain what characterizes one class against another.

Am I correct?. Should we always reduce dimensions with PCA if needed or there are considerations that need to be taken (as the one above)?

Best Answer

Blindly using PCA is a recipe for disaster. (As an aside, automatically applying any method is not a good idea, because what works in one context is not guaranteed to work in another. We can formalize this intuitive idea with the No Free Lunch theorem.)

It's easy enough to construct an example where the eigenvectors to the smallest eigenvalues are the most informative. If you discard this data, you're discarding the most helpful information for your classification or regression problem, and your model would be improved if you had retained them.

More concretely, suppose $A$ is our design matrix, and each column is mean-centered. Then we can use SVD to compute the PCA of $A$. (see: Relationship between SVD and PCA. How to use SVD to perform PCA?)

For an example in the case of a linear model, this gives us a factorization $$ AV = US $$

and we wish to predict some outcome $y$ as a linear combination of the PCs: $AV\beta = y+\epsilon$ where $\epsilon$ is some noise. Further, let's assume that this linear model is the correct model.

In general, the vector $\beta$ can be anything, just as in an ordinary OLS regression setting; but in any particular problem, it's possible that the only nonzero elements of $\beta$ are the ones corresponding to the smallest positive singular values. Whenever this is the case, using PCA to reduce the dimension of $AV$ by discarding the smallest singular values will also discard the only relevant predictors of $y$. In other words, even though we started out with the correct model, the truncated model is not correct because it omits the key variables.

In other words, PCA has a weakness in a supervised learning scenario because it is not "$y$-aware." Of course, in the cases where PCA is a helpful step, then $\beta$ will have nonzero entries corresponding to the larger singular values.

I think this example is instructive because it shows that even in the special case that the model is linear, truncating $AV$ risks discarding information.

Other common objections include:

Some more examples can be found in this closely-related thread (thanks, @gung!): Examples of PCA where PCs with low variance are "useful"