PCA – Can Multiple Principal Components Be Correlated to the Same Independent Variable?

biostatisticspcaregression

Assume I have a large dataset of variables for a population with known metadata, such as age, sex and ethnicity. I perform PCA on said dataset to reduce the amount of variables. Then, regression analysis is used to model the correlation of the principal components with the metadata.

Is there a hypothetical situation where multiple principal components correlate with the same response variable, such as age? My first guess would be that this is impossible, as they would be captured in the same principal component.

Additionally, would it be useless to fit a linear regression model with multiple principal components for one response variable? Since (if the answer to my first question is "No") multiple principal components cannot have more predictive value for a response variable than one principal component.

Best Answer

In short, yes this is very possible, and probably even the generic situation.

PCA components have the special feature that they are mutually uncorrelated. If we call your PCA components $x_a$, then $$cov(x_a, x_b) = 0 ~~~ \text{for} ~ a \neq b$$

But they can still be correlated with some other variable, i.e. you can have $$cov(x_a, y) \neq 0 ~~~ \text{and} ~~~ cov(x_b, y) $$ at the same time.

One can prove this mathematically, but the geometric interpretation of the sample (co)variance helps seeing this intuitively here:

Say we label your datapoints by the index $i = 1 , \cdots , N$, then we can think of each principal component as a vector $ \vec{x_a} $ by simply stacking the observed values together $$\vec{x_a} = (x_{a,1}, x_{a,2}, \cdots, x_{a,N} )^T $$

Now, looking at the definition of variance $$ var(x_a) = \mathbb{E}[x_a^2] = \frac{1}{N} \sum_{i=1}^N x_{a,i}^2 $$ where we used w.l.o.g that the PCAs are centered, i.e. $\mathbb{E}[x_a] = 0$. Now you can see that the variance of $x_a$ is simply the Euclidean norm, aka length, of the vector $\vec{x_a}$ $$ var(x_a) = \frac{1}{N} ||\vec{x_a}||^2 $$

Similarly, looking at the covariance $$ cov(x_a,x_b) = \mathbb{E}[x_a x_b]= \frac{1}{N} \sum_{i=1}^N x_{a,i}x_{b,i} $$ we recognise that as the dot product of $\vec{x_a}$ with $\vec{x_b}$ $$ cov(x_a,x_b) = \frac{1}{N} \, \vec{x_a} \cdot \vec{x_b} = \frac{1}{N} \, |\vec{x_a}| \, |\vec{x_b}| \, \cos(\phi_{ab}) $$

and $ cov(x_a,y) = \frac{1}{N} \, \vec{x_a} \cdot \vec{y}$.

Having this geometric interpretation of variance and covariance allows us to re-state the question in a way that makes the answer obvious:

Can you have two orthogonal vectors, $\vec{x_a} $ and $\vec{x_b} $, which each have overlap with some other vector $\vec{y} $?

The answer to that is clearly yes... a simple example of that is the 2-d plane: say $\vec{x_a} $ points along the horizontal axis and $\vec{x_b} $ points along the vertical axis, then

  • they are mutually orthogonal and
  • any other generic vector in the 2-d plane will have some non-zero overlap with both of them

This also answers your second question: no it's not useless to do a fit with multiple principal components and one response variable :)

Related Question