Solved – How to use principal components as predictors in regression

factor analysisfactor-rotationlogisticpcaregression

I have a couple of questions involving doing a regression (logistic or linear) after principal component analysis.

If I find principal components using Principal component analysis, can I use these components like regular variables to be used in linear and logistic regression?
If not, do I have to perform an extra step like "rotate" the principal components? What is this "rotation", what is it actually doing, and why is it helpful for using in regression? (this is what happens in factor analysis, no?)
If I can use these components, rotated or not, can I simply setup the model and bind it with the outcome data? Do I need to transform the outcome data in any way? Or can I use the original outcome data with the principal components?

Best Answer

Yes.

That's exactly what principal component regression is: https://en.wikipedia.org/wiki/Principal_component_regression.
No need to rotate.

In fact, rotating would not make any difference as far as prediction is concerned; see Using varimax-rotated PCA components as predictors in linear regression. Of course if you want to interpret individual regression coefficients, then you'd need to have interpretable predictors; rotated PCs might be more interpretable (or not).
No need to transform.

The dependent ("outcome") variable should be left alone. Including it into PCA would be cheating. There is no transformation that you should use only because you use PCA.

In general, you might want to read How can top principal components retain the predictive power on a dependent variable (or even lead to better predictions)?.

Related Solutions

Principal Components – How to Compute Varimax-Rotated Principal Components in R?

"Rotations" is an approach developed in factor analysis; there rotations (such as e.g. varimax) are applied to loadings, not to eigenvectors of the covariance matrix. Loadings are eigenvectors scaled by the square roots of the respective eigenvalues. After the varimax rotation, the loading vectors are not orthogonal anymore (even though the rotation is called "orthogonal"), so one cannot simply compute orthogonal projections of the data onto the rotated loading directions.

@FTusell's answer assumes that varimax rotation is applied to the eigenvectors (not to loadings). This would be pretty unconventional. Please see my detailed account of PCA+varimax for details: Is PCA followed by a rotation (such as varimax) still PCA? Briefly, if we look at the SVD of the data matrix $X=USV^\top$, then to rotate the loadings means inserting $RR^\top$ for some rotation matrix $R$ as follows: $X=(UR)(R^\top SV^\top).$

If rotation is applied to loadings (as it usually is), then there are at least three easy ways to compute varimax-rotated PCs in R :

They are readily available via function psych::principal (demonstrating that this is indeed the standard approach). Note that it returns standardized scores, i.e. all PCs have unit variance.
One can manually use varimax function to rotate the loadings, and then use the new rotated loadings to obtain the scores; one needs to multiple the data with the transposed pseudo-inverse of the rotated loadings (see formulas in this answer by @ttnphns). This will also yield standardized scores.
One can use varimax function to rotate the loadings, and then use the $rotmat rotation matrix to rotate the standardized scores obtained with prcomp.

All three methods yield the same result:

irisX <- iris[,1:4]      # Iris data
ncomp <- 2

pca_iris_rotated <- psych::principal(irisX, rotate="varimax", nfactors=ncomp, scores=TRUE)
print(pca_iris_rotated$scores[1:5,])  # Scores returned by principal()

pca_iris        <- prcomp(irisX, center=T, scale=T)
rawLoadings     <- pca_iris$rotation[,1:ncomp] %*% diag(pca_iris$sdev, ncomp, ncomp)
rotatedLoadings <- varimax(rawLoadings)$loadings
invLoadings     <- t(pracma::pinv(rotatedLoadings))
scores          <- scale(irisX) %*% invLoadings
print(scores[1:5,])                   # Scores computed via rotated loadings

scores <- scale(pca_iris$x[,1:2]) %*% varimax(rawLoadings)$rotmat
print(scores[1:5,])                   # Scores computed via rotating the scores

This yields three identical outputs:

1 -1.083475  0.9067262
2 -1.377536 -0.2648876
3 -1.419832  0.1165198
4 -1.471607 -0.1474634
5 -1.095296  1.0949536

Note: The varimax function in R uses normalize = TRUE, eps = 1e-5 parameters by default (see documentation). One might want to change these parameters (decrease the eps tolerance and take care of Kaiser normalization) when comparing the results to other software such as SPSS. I thank @GottfriedHelms for bringing this to my attention. [Note: these parameters work when passed to the varimax function, but do not work when passed to the psych::principal function. This appears to be a bug that will be fixed.]

Principal Component Analysis – Different Results in SPSS and Stata After Rotation

You are correct. Stata is weird about this. Stata gives different results from SAS, R and SPSS, and it is difficult (in my opinion) to understand why without delving quite deep into the world of factor analysis and PCA.

Here's how you know that something weird is happening. The sum of the squared loadings for a component are equal to the eigenvalue for that component.

Pre-and post-rotation, the eigenvalues change, but the total eigenvalues don't change. Add up the sum of the squared loadings from your output (this is why I asked you to remove the blanks in my comment). With Stata's default, the sum of squared loadings will sum to 1.00 (within rounding error). With SPSS (and R, and SAS, and every other factor analysis program I've looked at) they will sum to the eigenvalue for that factor. (Post rotation eigenvalues change, but the sum of eigenvalues stays the same). The sum of squared loadings in SPSS is equal to the sum of the eigenvalues (i.e. 3.8723 + 1.40682), both pre- and post-rotation.

In Stata, the sum of the squared loadings for each factor is equal to 1.00, and so Stata has rescaled the loadings.

The only mention of this (that I have found) in the Stata documentation is in the estat loadings section of the help, where it says:

cnorm(unit | eigen | inveigen), an option used with estat loadings, selects the normalization of the eigenvectors, the columns of the principal-component loading matrix. The following normalizations are available

However, this appears to apply only to the unrotated component matrix, not the component rotated matrix. I can't get the unnormalized rotated matrix after PCA.

The people at Stata seem to know what they are doing, and usually have a good reason for doing things the way that they do. This one is beyond me though.

(For future reference, it would have made my life easier if you'd used a dataset that I could access, and if you'd included all output, without blanks).

Edit: My usual go-to site for information about how to get the same results for different programs is the UCLA IDRE. They don't cover PCA in Stata: http://www.ats.ucla.edu/stat/AnnotatedOutput/ I have to wonder if that's because they couldn't get the same result. :)

Best Answer

Related Solutions

Principal Components – How to Compute Varimax-Rotated Principal Components in R?

Principal Component Analysis – Different Results in SPSS and Stata After Rotation

Related Question