Regression – How Top Principal Components Retain Predictive Power for Dependent Variables: PCA and Dimensionality Reduction

classificationdimensionality reductionpcaregressionregularization

Suppose I am running a regression $Y \sim X$. Why by selecting top $k$ principle components of $X$, does the model retain its predictive power on $Y$?

I understand that from dimensionality-reduction/feature-selection point of view, if $v_1, v_2, … v_k$ are the eigenvectors of covariance matrix of $X$ with top $k$ eigenvalues, then $Xv_1, Xv_2 … Xv_k$ are top $k$ principal components with maximum variances. We can thereby reduce the number of features to $k$ and retain most of the predictive power, as I understand it.

But why do top $k$ components retain the predictive power on $Y$?

If we talk about a general OLS $Y \sim Z$, there is no reason to suggest that if feature $Z_i$ has maximum variance, then $Z_i$ has the most predictive power on $Y$.

Update after seeing comments: I guess I have seen tons of examples of using PCA for dimensionality reduction. I have been assuming that means the dimensions we are left with have the most predictive power. Otherwise what's the point of dimensionality reduction?

Best Answer

Indeed, there is no guarantee that top principal components (PCs) have more predictive power than the low-variance ones.

Real-world examples can be found where this is not the case, and it is easy to construct an artificial example where e.g. only the smallest PC has any relation to $y$ at all.

This topic was discussed a lot on our forum, and in the (unfortunate) absence of one clearly canonical thread, I can only give several links that together provide various real life as well as artificial examples:

And the same topic, but in the context of classification:


However, in practice, top PCs often do often have more predictive power than the low-variance ones, and moreover, using only top PCs can yield better predictive power than using all PCs.

In situations with a lot of predictors $p$ and relatively few data points $n$ (e.g. when $p \approx n$ or even $p>n$), ordinary regression will overfit and needs to be regularized. Principal component regression (PCR) can be seen as one way to regularize the regression and will tend to give superior results. Moreover, it is closely related to ridge regression, which is a standard way of shrinkage regularization. Whereas using ridge regression is usually a better idea, PCR will often behave reasonably well. See Why does shrinkage work? for the general discussion about bias-variance tradeoff and about how shrinkage can be beneficial.

In a way, one can say that both ridge regression and PCR assume that most information about $y$ is contained in the large PCs of $X$, and this assumption is often warranted.

See the later answer by @cbeleites (+1) for some discussion about why this assumption is often warranted (and also this newer thread: Is dimensionality reduction almost always useful for classification? for some further comments).

Hastie et al. in The Elements of Statistical Learning (section 3.4.1) comment on this in the context of ridge regression:

[T]he small singular values [...] correspond to directions in the column space of $\mathbf X$ having small variance, and ridge regression shrinks these directions the most. [...] Ridge regression protects against the potentially high variance of gradients estimated in the short directions. The implicit assumption is that the response will tend to vary most in the directions of high variance of the inputs. This is often a reasonable assumption, since predictors are often chosen for study because they vary with the response variable, but need not hold in general.

See my answers in the following threads for details:


Bottom line

For high-dimensional problems, pre-processing with PCA (meaning reducing dimensionality and keeping only top PCs) can be seen as one way of regularization and will often improve the results of any subsequent analysis, be it a regression or a classification method. But there is no guarantee that this will work, and there are often better regularization approaches.