Solved – PCA; variance, interpretability, and scaling

pcastandardizationvariance

I've been going through threads about PCA and the predictive power of various axes, but since I cannot comment, I am opening a new question.

There is a lot of discussion whether PCA components with small variance can be highly predictive and thus we should not 'drop them'? In general the answer is 'yes', since it all depends on the data, but the examples that are typically given are a bit confusing to me.

A typical example goes like this: we have a feature x_1 (e.g. height) that fluctuates a lot and feature x_2 (e.g. temperature) that varies less, but on the other hand somehow effects our label most. Thus, low variance axes can be important.

I have a few questions in regards to the above:

1) the examples refer to features in the non-transformed system (prior to applying PCA) and since these features need to be/can be standarised, the whole idea of their absolute variance looses the point, no?

2) after the PCA transformation there can be a heterogeneity of principal axes variance, but these axes are now linear combinations of the original axes, so when we 'drop' one or more to reduce the dimension, we do preserve partial information from all the original axes (i.e. we do not drop any original axes)

Just to summarise my problem – it would seem that all the examples refer to the degree of variance in the original data and not variance after the transformation (maximisation of which is the objective of PCA) so I am wondering what would be a correct example of a case where low-variance PCA axes can be important?

(the issue being also that after PCA the data looses its interpretability).

Best Answer

This question appears to focus on two things:

How can principal components regression fail badly?
Could the failure be due to not standardizing the variables?

Let me elaborate on the meanings of these terms. The setting is where you have $(x_1, x_2, \ldots, x_k, y)$ data. You view the values of the $y$ as $n$ independent observations of random variables $Y(x_1, \ldots, x_k)$ that depend on the $x$ values. You wish to model the expectation of $Y$ (equivalently, you wish to "predict" $Y$) by means of a linear function of the $x_i,$

$$E[Y] = h(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)$$

for some stipulated function $h$ and unknown parameters $\beta_j.$ (See https://stats.stackexchange.com/a/148713/919 for more about this approach, which when $h$ is the identity underlies the usual ordinary least squares model.)

In many practical cases, especially observational studies, the observed variables (thought of as $k$ $n$-vectors) may be close to linearly dependent, a condition referred to as "multicollinearity." This creates technical problems for fitting the model and conceptual problems in interpreting the estimated coefficients.

A remedy that is frequently offered is to conduct a preliminary Principal Components Analysis (PCA) of the regressor variables. PCA replaces those $k$ vectors by $k$ other vectors ("principal components") $z_1, \ldots, z_k$ with the following properties:

For any $i\ne j,$ $z_i$ and $z_j$ are orthogonal.
For $i \lt j,$ the Euclidean length of $z_i$ equals or exceeds that of $z_j.$
For every $i,$ $x_i$ can be reconstructed as a linear combination of a constant and the $z_j.$

The first two properties greatly simplify the model fitting, while the last guarantees that the model has not really changed: the $z_i$ are merely a way of re-expressing the $x_i.$

The Euclidean length in $(2)$ is usually interpreted as something proportional to a variance, because the $z_j$ are almost always constructed so that their components sum to zero. In this sense, $z_1$ is a "direction of greatest variability" of the $x_i,$ $z_2$ is a direction of second greatest variability, and so on. Thus, in reconstructing the $x_i$ from the $z_j,$ the first few principal components contribute the most and may be referred to as the "most important" components. Usually one can safely omit the last few principal components and still very closely reconstruct the original $x_i.$ This further simplifies the modeling by reducing the number of independent variables.

There is a fundamental problem with this "principal component regression" approach: it makes no consideration of the observed responses $y.$ Consequently, it may utterly fail.

I offer an example and then, at the end, show how to construct many others like it with any values of $n$ and $k.$

First, here is a scatterplot matrix showing the relationships among $k=3$ variables and a response $y$ for $n=200$ observations. The red lines are least-squares fits.

The covariance matrix of the $x_i$ shows they already have comparable scales, so it will make little difference whether PCA uses the covariance or the correlation:

     X.1  X.2  X.3
X.1 1.25 1.22 0.10
X.2 1.22 1.20 0.10
X.3 0.10 0.10 0.97

In the following analysis the correlation matrix was used: that is, the $x_i$ were standardized before applying PCA. Using the covariance matrix, the results would be qualitatively the same (I checked).

The strong positive correlation between $x_1$ and $x_2$ creates multicollinearity. This is evident in the "scree plot" for the PCA, which shows the relative squared Euclidean lengths of the $z_j:$

The tiny third component can be ignored, indicating these three original variables can be accurately reconstructed from just two principal components. Confirming this, a standard summary reports the Euclidean lengths of the components:

Importance of components:
                         PC1    PC2     PC3
Standard deviation     1.419 0.9910 0.07023
Proportion of Variance 0.671 0.3273 0.00164
Cumulative Proportion  0.671 0.9984 1.00000

Seeing this output, anyone using PCA regression would surely drop the last component $z_3$ or PC3), figuring that "the first two components explain 99.84% (essentially 100%) of the variation among the regressors."

Let's see how $y$ relates to the principal components $z_j.$ Here is the top row of the scatterplot matrix of $(y, z_1, z_2, z_3)$ (the other rows will show perfect zero correlations among the $z_j$ due to property $(1)$ of the PCA):

Therein lies the problem: although $y$ is uncorrelated with the principal components that would be kept, it is strongly correlated with the one that would be dropped.

Let's examine the two ordinary least squares regression models under consideration. First, the "full" or "base" model that includes all the $x_i:$

             Estimate Std. Error  t value Pr(>|t|)    
(Intercept) -0.024359   0.006636   -3.671 0.000312 ***
X.1         -9.053688   0.059840 -151.299  < 2e-16 ***
X.2          9.237486   0.061058  151.289  < 2e-16 ***
X.3         -0.055985   0.006776   -8.263  2.1e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.09354 on 196 degrees of freedom
Multiple R-squared:  0.9915,    Adjusted R-squared:  0.9914 
F-statistic:  7651 on 3 and 196 DF,  p-value: < 2.2e-16

This demonstrates that $y$ can be predicted with high precision from these three variables (and an intercept term), with relatively little error. (It is possible to construct examples in which this full regression has an arbitrary high p-value: in other words, it looks "insignificant." Because such examples typically have higher values of $k,$ I won't encumber this post with the lengthy output.)

Here are results for the PCA regression in which every principal component but the last is included in the model:

             Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.006847   0.071687   0.096    0.924
PC1         -0.002607   0.050653  -0.051    0.959
PC2         -0.015818   0.072521  -0.218    0.828

Residual standard error: 1.014 on 197 degrees of freedom
Multiple R-squared:  0.0002549, Adjusted R-squared:  -0.009895 
F-statistic: 0.02511 on 2 and 197 DF,  p-value: 0.9752

This model is worthless: the adjusted $R^2$ is essentially zero all p-values are large, and all estimated coefficients are indistinguishable from zero. As such, it reflects a dramatic failure of PCA regression.

It should be clear why the failure occurred: $y$ is very nearly a linear function of the smallest principal component. Indeed, when we include only the last principal component $z_3$ we obtain a great model:

             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.006847   0.006679   1.025    0.307    
PC3         14.302384   0.095341 150.013   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.09445 on 198 degrees of freedom
Multiple R-squared:  0.9913,    Adjusted R-squared:  0.9912 
F-statistic: 2.25e+04 on 1 and 198 DF,  p-value: < 2.2e-16

Herein lies a key insight: when the full model fits the response $y$ well but principal components regression does not, that means the response is related to information present in the smaller principal components. This suggests several effective courses of action. For instance, you might fit a sequence of models in which you introduce each principal component in order (from largest to smallest) until obtaining one that comes sufficiently close to the full model. I will omit the details because there are many considerations and they are well discussed in other threads on variable selection and model building.

The R code to create such examples reveals their method of construction: simply create the response $y$ from the smallest principal components of the $x_i.$ It's that easy. Of course things don't work this way in the "real world." The point is that it's possible for your response in your problem to behave like this, so you should be aware of that, know how to detect it, and have methods to fix it.

n <- 200        # Sample size
k <- 3          # Independent variable count
sigma <- 0.1    # Error SD for `y`
tau <- 0.1      # Controls multicollinearity of `X`; 0 is collinear
# set.seed(17)  # Uncomment to reproduce the figures exactly
#
# Create multicollinear explanatory variables.
#
X <- matrix(rnorm(n*(k-1)), nrow=n)
X <- cbind(X[, 1] + rnorm(n, 0, tau), X)
colnames(X) <- paste("X", 1:k, sep=".")
round(cov(X), 2)               # Display their covariance matrix
#
# Conduct PCA and keep the largest PCs.
# Set `scale.=FALSE` to do PCA on the covariances.
#
PCA <- prcomp(X, scale.=TRUE)
plot(PCA)
summary(PCA)
k.keep <- sum(PCA$sdev > 0.5) # Some people retain only those exceeding 1
#
# Construct a response that is related primarily to a small PC.
#
y <- c(scale(PCA$x[, k])) + rnorm(n, 0, sigma)
#
# Compare the full regression to (a) PC regression and 
# (b) regression on the last PC only.
#
summary(lm(y ~ ., as.data.frame(X)))                             # Full regression
summary(lm(y ~ ., as.data.frame(PCA$x[, 1:k.keep, drop=FALSE]))) # PC regression
summary(lm(y ~ ., as.data.frame(PCA$x[, k, drop=FALSE])))

Best Answer

Related Solutions

PCA – Appropriateness of PCA to Visualize Clusters in Genetic Data

PCA – Common Questions on PCA: Independence of Principal Components, Sensitivity to Scaling, and Orthogonality Constraints

Related Question