I want to apply a PCA on a dataset, which consists of mixed type variables (continuous and binary). To illustrate the procedure, I paste a minimal reproducible example in R below.
# Generate synthetic dataset
set.seed(12345)
n <- 100
x1 <- rnorm(n)
x2 <- runif(n, -2, 2)
x3 <- x1 + x2 + rnorm(n)
x4 <- rbinom(n, 1, 0.5)
x5 <- rbinom(n, 1, 0.6)
data <- data.frame(x1, x2, x3, x4, x5)
# Correlation matrix with appropriate coefficients
# Pearson product-moment: 2 continuous variables
# Point-biserial: 1 continuous and 1 binary variable
# Phi: 2 binary variables
# For testing purposes use hetcor function
library(polycor)
C <- as.matrix(hetcor(data=data))
# Run PCA
pca <- princomp(covmat=C)
L <- loadings(pca)
Now, I wonder how to calculate component scores (i.e., raw variables weighted by component loadings). When dataset consists of continuous variables, component scores are simply obtained by multiplying (scaled) raw data and eigenvectors stored in loading matrix (L in the example above). Any pointers would be greatly appreciated.
Best Answer
I think Insanodag is right. I quote Jollife's Principal Component Analysis:
Multiplying the data matrix with the loadings matrix will give the desired result. However, I've had some problems with
princomp()
function so I usedprcomp()
instead.One of the return values of the function
prcomp()
isx
, which is activated usingretx=TRUE
. This x is the multiplication of the data matrix by the loadings matrix as stated in the R Documentation:Let me know if this was useful, or if it needs further corrections.
--
I.T. Jollife. Principal Component Analysis. Springer. Second Edition. 2002. pp 339-343.