PCA – Can Multiple Principal Components Be Correlated to the Same Independent Variable?

biostatisticspcaregression

Assume I have a large dataset of variables for a population with known metadata, such as age, sex and ethnicity. I perform PCA on said dataset to reduce the amount of variables. Then, regression analysis is used to model the correlation of the principal components with the metadata.

Is there a hypothetical situation where multiple principal components correlate with the same response variable, such as age? My first guess would be that this is impossible, as they would be captured in the same principal component.

Additionally, would it be useless to fit a linear regression model with multiple principal components for one response variable? Since (if the answer to my first question is "No") multiple principal components cannot have more predictive value for a response variable than one principal component.

Best Answer

In short, yes this is very possible, and probably even the generic situation.

PCA components have the special feature that they are mutually uncorrelated. If we call your PCA components $x_a$, then $$cov(x_a, x_b) = 0 ~~~ \text{for} ~ a \neq b$$

But they can still be correlated with some other variable, i.e. you can have $$cov(x_a, y) \neq 0 ~~~ \text{and} ~~~ cov(x_b, y) $$ at the same time.

One can prove this mathematically, but the geometric interpretation of the sample (co)variance helps seeing this intuitively here:

Say we label your datapoints by the index $i = 1 , \cdots , N$, then we can think of each principal component as a vector $ \vec{x_a} $ by simply stacking the observed values together $$\vec{x_a} = (x_{a,1}, x_{a,2}, \cdots, x_{a,N} )^T $$

Now, looking at the definition of variance $$ var(x_a) = \mathbb{E}[x_a^2] = \frac{1}{N} \sum_{i=1}^N x_{a,i}^2 $$ where we used w.l.o.g that the PCAs are centered, i.e. $\mathbb{E}[x_a] = 0$. Now you can see that the variance of $x_a$ is simply the Euclidean norm, aka length, of the vector $\vec{x_a}$ $$ var(x_a) = \frac{1}{N} ||\vec{x_a}||^2 $$

Similarly, looking at the covariance $$ cov(x_a,x_b) = \mathbb{E}[x_a x_b]= \frac{1}{N} \sum_{i=1}^N x_{a,i}x_{b,i} $$ we recognise that as the dot product of $\vec{x_a}$ with $\vec{x_b}$ $$ cov(x_a,x_b) = \frac{1}{N} \, \vec{x_a} \cdot \vec{x_b} = \frac{1}{N} \, |\vec{x_a}| \, |\vec{x_b}| \, \cos(\phi_{ab}) $$

and $ cov(x_a,y) = \frac{1}{N} \, \vec{x_a} \cdot \vec{y}$.

Having this geometric interpretation of variance and covariance allows us to re-state the question in a way that makes the answer obvious:

Can you have two orthogonal vectors, $\vec{x_a} $ and $\vec{x_b} $, which each have overlap with some other vector $\vec{y} $?

The answer to that is clearly yes... a simple example of that is the 2-d plane: say $\vec{x_a} $ points along the horizontal axis and $\vec{x_b} $ points along the vertical axis, then

they are mutually orthogonal and
any other generic vector in the 2-d plane will have some non-zero overlap with both of them

This also answers your second question: no it's not useless to do a fit with multiple principal components and one response variable :)

Related Solutions

Solved – How to apply regression on principal components to predict an output variable

You don't choose a subset of your original 99 (100-1) variables.

Each of the principal components are linear combinations of all 99 predictor variables (x-variables, IVs, ...). If you use the first 40 principal components, each of them is a function of all 99 original predictor-variables. (At least with ordinary PCA - there are sparse/regularized versions such as the SPCA of Zou, Hastie and Tibshirani that will yield components based on fewer variables.)

Consider the simple case of two positively correlated variables, which for simplicity we will assume are equally variable. Then the first principal component will be a (fractional) multiple of the sum of both variates and the second will be a (fractional) multiple of the difference of the two variates; if the two are not equally variable, the first principal component will weight the more-variable one more heavily, but it will still involve both.

So you start with your 99 x-variables, from which you compute your 40 principal components by applying the corresponding weights on each of the original variables. [NB in my discussion I assume $y$ and the $X$'s are already centered.]

You then use your 40 new variables as if they were predictors in their own right, just as you would with any multiple regression problem. (In practice, there's more efficient ways of getting the estimates, but let's leave the computational aspects aside and just deal with a basic idea)

In respect of your second question, it's not clear what you mean by "reversing of the PCA".

Your PCs are linear combinations of the original variates. Let's say your original variates are in $X$, and you compute $Z=XW$ (where $X$ is $n\times 99$ and $W$ is the $99\times 40$ matrix which contains the principal component weights for the $40$ components you're using), then you estimate $\hat{y}=Z\hat{\beta}_\text{PC}$ via regression.

Then you can write $\hat{y}=Z\hat{\beta}_\text{PC}=XW\hat{\beta}_\text{PC}=X\hat{\beta}^*$ say (where $\hat{\beta}^*=W\hat{\beta}_\text{PC}$, obviously), so you can write it as a function of the original predictors; I don't know if that's what you meant by 'reversing', but it's a meaningful way to look at the original relationship between $y$ and $X$. It's not the same as the coefficients you get by estimating a regression on the original X's of course -- it's regularized by doing the PCA; even though you'd get coefficients for each of your original X's this way, they only have the d.f. of the number of components you fitted.

Also see Wikipedia on principal component regression.

pca – Extrapolating Principal Components Factors with Other Variables

The method you selected from the page you cite is incorrect, or at least not standard, as the author of that answer explains below the code that you used. It applies the varimax rotation to the original eigenvectors from the PCA, which is not standard practice.

For this type of analysis, "Loadings are eigenvectors scaled by the square roots of the respective eigenvalues," as explained on that page in the answer from @amoeba, while your prc$rotation values are unscaled eigenvectors. Of the 3 correct methods shown in that answer, the one perhaps closest to your code (using the first 4 principal components) might be translated to:

rawLoadings     <- prc$rotation[,1:4] %*% diag(prc$sdev, 4, 4) # scaling
rotatedLoadings <- varimax(rawLoadings)$loadings # varimax rotation after scaling
invLoadings     <- t(pracma::pinv(rotatedLoadings)) # transpose of generalized inverse
scores          <- scale(df) %*% invLoadings

To avoid errors, you should consider using packages that have been vetted to provide correct results, like the R psych package. That's also illustrated in the answer from @ameoba.

Best Answer

Related Solutions

Solved – How to apply regression on principal components to predict an output variable

pca – Extrapolating Principal Components Factors with Other Variables

Related Question