Solved – Principal component regression analysis using SPSS

factor analysispcaregressionspss

I have done multiple regression analysis (MLR) of my data and find out $R^2$ and $r$, and then to remove multicollinearity problem I used PCA. This analysis generated PC equal to my variables, I selected only those factors having eigen value greater than 1 after rotation (varimax rotation). Now I'm confused that if I put factor scores of only selected PC as independent variables in MLR I obtained $r$ and $R^2$ less than those obtained by simple MLR but it should be higher. Where I am wrong?

Best Answer

If you include all components, not just ones that have an eigenvalue above 1, you will have the same $R^2$. The extracted principal components can just be thought of as linear combinations of the original variables, and these transformed variables need not have any obvious relation to the linear regression with the original variables.

Here is an example in SPSS. So lets make some fake data for 5 X variables. Four of these variables are highly correlated, but one is uncorrelated.

*Making some fake data.
SET SEED 5.
INPUT PROGRAM.
LOOP Id = 1 TO 1000.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.

*Make a set of variables that are highly correlated, except for X5.
COMPUTE #latent = RV.NORMAL(0,1).
VECTOR X(5).
COMPUTE #size = 0.9.
LOOP #i = 1 TO 5.
  DO IF #i <= 4.
    COMPUTE X(#i) = #size*#latent + RV.NORMAL(0,SQRT(1-#size**2)).
  ELSE.
    COMPUTE X(#i) = RV.NORMAL(0,1).
  END IF.
END LOOP.
COMPUTE Y = 5 + 0.2*X1 - 0.3*X2 + 0.4*X3 + 0.1*X4 - 0.2*X5 + RV.NORMAL(0,1).
FREQ VAR X1 TO Y /FORMAT = NOTABLE /STATISTICS = MEAN VARIANCE.

So here X1 to X4 do have an underlying latent variable, but they each uniquely affect Y in different ways. So when we regress all of the X's on Y we get close to the population parameters for the estimates - even with many of the X's highly correlated. The $R^2$ here is only 0.168 - but this is the correct specification given my simulation.

REGRESSION VARIABLES = Y,X1,X2,X3,X4,X5
  /DEPENDENT = Y
  /METHOD = ENTER X1 X2 X3 X4 X5.

Now, if we ran the default FACTORS command, we would only extract one component using the eigenvalue above 1 rule. This is because I did not generate X5 to be exactly orthogonal to the other X's. It ends up being very close to 1 though - and if you looked at the Scree plot you would probably keep it. Here I force SPSS to extract all 5 principle components (pet peeve - it makes no sense to talk about rotation when we are extracting principal components!)

FACTOR VARIABLES = X1 X2 X3 X4 X5
  /CRITERIA = FACTORS(5)
  /PLOT EIGEN
  /SAVE REG (5,PComp).

Now if we include all 5 of these orthogonal principal components it will equal the same $R^2$ as the original equation, 0.168, but the coefficients have no relationship to the simulated data. The $R^2$ will never be higher than what was found including the original measurements.

REGRESSION VARIABLES = Y,PComp1,PComp2,PComp3,PComp4,PComp5
  /DEPENDENT = Y
  /METHOD = ENTER PComp1 PComp2 PComp3 PComp4 PComp5.

If we mindlessly use the eigenvalue over 1 criteria, the $R^2$ decreases to 0.08.

REGRESSION VARIABLES = Y,PComp1,PComp2,PComp3,PComp4,PComp5
  /DEPENDENT = Y
  /METHOD = ENTER PComp1.

This, unfortunately, is a very common procedure in the social sciences where it isn't necessarily warranted. When you do this, you are basically making a case for a congeneric measurement model where the underlying latent variable is what affects Y, and you measure the latent variable using the principal component scores. If the original variables can affect Y in unique ways reducing those variables to their principal component scores is inappropriate. Many times people do it mindlessly just because a few correlations are high - which if you looked at the original regression here the standard errors are small enough that shouldn't be any concern at all.