Solved – How to use principal components analysis to select variables for regression

model selectionpcaregression

I am currently using principal components analysis to select variables to use in modelling. At the moment, I make measurements A, B and C in my experiments — What I really want to know is: Can I make fewer measurements and stop recording C and or B to save time and effort?

I find that all 3 variables load heavily onto my first principal component which accounts for 60% of the variance in my data. The component scores tell me that if I add these variables together in a certain ratio (aA+bB+cC). I can get a score on PC1 for each case in my dataset and could use this score as a variable in modelling, but that doesn't allow me to stop measuring B and C.

If I square the loadings of A and B and C on PC1, I find that variable A accounts for 65% of the variance in PC1 and variable B accounts for 50% of the the variance in PC1 and variable C also 50%, i.e. some of the variance in PC1 accounted for by each variable A, B and C is shared with another variable, but A comes out on top accounting for slightly more.

Is it wrong to think that I could just choose variable A or possibly (aA+bB, if necessary) to use in modelling because this variable describes a large proportion of the variance in PC1 and this in turn describes a large proportion of the variance in the data?

Which approach have you gone for in the past?

  • Single variable which loads heaviest on PC1 even if there are other heavy loaders?
  • Component score on PC1 using all variables even if they are all heavy loaders?

Best Answer

You haven't specified what "modeling" you plan on, but it sounds like you're asking about how to select independent variables among $A$, $B$, and $C$ for the purpose of (say) regressing a fourth dependent variable $W$ on them.

To see that this approach can go wrong, consider three independent Normally distributed variables $X$, $Y$, and $Z$ with unit variance. For the true, underlying model choose a small constant $\beta \ll 1$, a really tiny constant $\epsilon \ll \beta$, and let the (dependent variable) $W = Z$ (plus a little bit of error independent of $X$, $Y$, and $Z$).

Suppose the independent variables you have are $A = X + \epsilon Y$, $B = X - \epsilon Y$, and $C = \beta Z$. Then $W$ and $C$ are strongly correlated (depending on the variance of the error), because each is close to a multiple of $Z$. However, $W$ is uncorrelated with either of $A$ or $B$. Because $\beta$ is small, the first principal component for $\{A, B, C\}$ is parallel to $X$ with eigenvalue $2 \gg \beta$. $A$ and $B$ load heavily on this component and $C$ loads not at all because it is independent of $X$ (and $Y$). Nevertheless, if you eliminate $C$ from the independent variables, leaving only $A$ and $B$, you will be throwing away all information about the dependent variable because $W$, $A$, and $B$ are independent!

This example shows that for regression you want to pay attention to how the independent variables are correlated with the dependent one; you can't get away just by analyzing relationships among the independent variables.

Related Question