Solved – Why does Daniel Wilks (2011) say that principal component regression “will be biased”

biaspcaregression

In Statistical Methods in the Atmospheric Sciences, Daniel Wilks notes that multiple linear regression can lead to problems if there are very strong intercorrelations among the predictors (3rd edition, page 559-560):

A pathology that can occur in multiple linear regression is that a set of predictor variables having strong mutual correlations can result in the calculation of an unstable regression relationship.

(…)

He then introduces principal component regression:

An approach to remedying this problem is to first transform the predictors to their principal components, the correlations among which are zero.

So far so good. But next, he makes some statements that he does not explain (or at least not in sufficient detail for me to understand):

If all the principal components are retained in a principal component regression, then nothing is gained over the conventional least-squares fit to the full predictor set.

(..) and:

It is possible to reexpress the principal-component regression in terms of the original predictors, but the result will in general involve all the original predictor variables even if only one or a few principal component predictors have been used. This reconstituted regression will be biased, although often the variance is much smaller, resulting in a smaller MSE overall.

I don't understand these two points.

Of course, if all the principal components are retained, we use the same information as when we were using the predictors in their original space. However, the problem of mutual correlations is removed by working in principal component space. We may still have overfitting, but is that the only problem? Why is nothing gained?

Secondly, even if we do truncate the principal components (perhaps for noise reduction and/or to prevent overfitting), why and how does this lead to a biased reconstituted regression? Biased in what way?


Book source: Daniel S. Wilks, Statistical Methods in the Atmospheric Sciences, Third edition, 2011. International Geophysics Series Volume 100, Academic Press.

Best Answer

What happens when all PCs are used?

If all PCs are used, then the resulting regression coefficients will be identical to the ones obtained with OLS regression, and so this procedure should better not be called "principal component regression". It is standard regression, only performed in a roundabout way.

You are asking how it is possible that nothing is gained, given that after PCA the predictors become orthogonal. The devil hides in the back-transformation of the regression coefficients from the PCA space to the original space. What you need to know is that the variance of the estimated regression coefficients inversely depends on the covariance matrix of the predictors. The PCA-transformed predictors, let's call them $Z$, have diagonal covariance matrix (because they are uncorrelated). So all regression coefficients for $Z$ are also uncorrelated; the ones corresponding to the high-variance PCs have low variance (i.e. are estimated reliably) and the ones corresponding to the low-variance PCs have high variance (i.e. are estimated unreliably). When these coefficients are back-transformed to the original predictors $X$, each of the predictors $X_i$ will get some portion of the unreliable estimates, and so all coefficients can become unreliable.

So nothing is gained.

What happens when only few PCs are used?

When not all the PCs are retained in PCR, then the resulting solution $\hat \beta_\mathrm{PCR}$ will generally not be equal to the standard ordinary least squares solution $\hat \beta_\mathrm{OLS}$. It is a standard result that OLS solution is unbiased: see Gauss-Markov theorem. "Unbiased" means that $\hat \beta$ is correct on average, even though it can be very noisy. Since PCR solution differs from it, it will be biased, meaning that it will be incorrect on average. However, it often happens that it is substantially less noisy, leading to the overall more accurate predictions.

This is an example of the bias-variance trade-off. See Why does shrinkage work? for some further general discussion.

In the comments, @whuber pointed out that the PCR solution does not have to differ from the OLS one and hence does not have to be biased. Indeed, if the dependent variable $y$ is uncorrelated (in population, not in sample) with all the low-variance PCs that are not included in the PCR model, then dropping these PCs will not influence the unbiasedness. This, however, is unlikely to be the case in practice: PCA is conducted without taking $y$ into account so it stands to reason that $y$ will tend to be somewhat correlated with all the PCs.

Why using high-variance PCs is a good idea at all?

This was not part of the question, but you might be interested in the following thread for the further reading: How can top principal components retain the predictive power on a dependent variable (or even lead to better predictions)?