Solved – PCA and component scores based on a mix of continuous and binary variables

I want to apply a PCA on a dataset, which consists of mixed type variables (continuous and binary). To illustrate the procedure, I paste a minimal reproducible example in R below.

# Generate synthetic dataset
set.seed(12345)
n <- 100
x1 <- rnorm(n)
x2 <- runif(n, -2, 2)
x3 <- x1 + x2 + rnorm(n)
x4 <- rbinom(n, 1, 0.5)
x5 <- rbinom(n, 1, 0.6)
data <- data.frame(x1, x2, x3, x4, x5)

# Correlation matrix with appropriate coefficients
# Pearson product-moment: 2 continuous variables
# Point-biserial: 1 continuous and 1 binary variable
# Phi: 2 binary variables
# For testing purposes use hetcor function
library(polycor)
C <- as.matrix(hetcor(data=data))

# Run PCA
pca <- princomp(covmat=C)
L <- loadings(pca)

Now, I wonder how to calculate component scores (i.e., raw variables weighted by component loadings). When dataset consists of continuous variables, component scores are simply obtained by multiplying (scaled) raw data and eigenvectors stored in loading matrix (L in the example above). Any pointers would be greatly appreciated.

rotation: the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors). The function ‘princomp’ returns this in the element ‘loadings’. x: if ‘retx’ is true the value of the rotated data (the centred (and scaled if requested) data multiplied by the ‘rotation’ matrix) is returned. Hence, ‘cov(x)’ is the diagonal matrix ‘diag(sdev^2)’. For the formula method, ‘napredict()’ is applied to handle the treatment of values omitted by the ‘na.action’.

Best Answer

I think Insanodag is right. I quote Jollife's Principal Component Analysis:

When PCA is used as a descriptive technique, there is no reason for the variables in the analysis to be of any particular type. [...] the basic objective of PCA - to summarize most of the 'variation' that is present in the original set of $p$ variables using smaller number of derived varaibles - can be achieved regardless of the nature of the original variables.

Multiplying the data matrix with the loadings matrix will give the desired result. However, I've had some problems with princomp() function so I used prcomp() instead.

One of the return values of the function prcomp() is x, which is activated using retx=TRUE. This x is the multiplication of the data matrix by the loadings matrix as stated in the R Documentation:

Let me know if this was useful, or if it needs further corrections.

I.T. Jollife. Principal Component Analysis. Springer. Second Edition. 2002. pp 339-343.

Best Answer

Related Solutions

Solved – Polychoric PCA and component loadings in Stata

Solved – PCA and component scores based on a mix of continuous, binary and categorical variables

Related Question