Solved – PCA and component scores based on a mix of continuous, binary and categorical variables

pcar

My question is strongly related to this one: PCA and component scores based on a mix of continuous and binary variables.
I will basically use the same code, but add a new nominal feature (x6) to the data set.

I want to apply a PCA on a dataset consisting of continuous, binary and categorical variables.

# Generate synthetic dataset
set.seed(12345)
n <- 100
x1 <- rnorm(n)
x2 <- runif(n, -2, 2)
x3 <- x1 + x2 + rnorm(n)
x4 <- rbinom(n, 1, 0.5)
x5 <- rbinom(n, 1, 0.6)
x6 <- c(rep('A', 25), rep('B', 25), rep('C', 25), rep('D', 25))
data <- data.frame(x1, x2, x3, x4, x5, x6)

# Correlation matrix with appropriate coefficients
# Pearson product-moment: 2 continuous variables
# Point-biserial: 1 continuous and 1 binary variable
# Phi: 2 binary variables
# For testing purposes use hetcor function
library(polycor)
C <- as.matrix(hetcor(data=data))

# Run PCA
pca <- princomp(covmat=C)
L <- loadings(pca)

Now in order to calculate the component scores, it was suggested to multiply the data set with the loadings L, which works fine for numerical and binary variables, but not on categorical data. The following computation causes the categorical feature to be a vector of NA´s.

scores <- data * L

How can I obtain the scores for this feature? Do I have to split it up into dummy variables to make this work?

Best Answer

As Deathkill14 and ttnphns pointed out, it is possible to split the categorical data into binary dummy variables. A solution could look like this (using this neat code snippet):

# Generate synthetic dataset
set.seed(12345)
n <- 100
x1 <- rnorm(n)
x2 <- runif(n, -2, 2)
x3 <- x1 + x2 + rnorm(n)
x4 <- rbinom(n, 1, 0.5)
x5 <- rbinom(n, 1, 0.6)
x6 <- c(rep('A', 25), rep('B', 25), rep('C', 25), rep('D', 25))

data <- data.frame(x1, x2, x3, x4, x5)

# Create dummy variables from x6
for(level in unique(x6)){
    data[paste("dummy", level, sep = "_")] <- ifelse(x6 == level, 1, 0)
}

# Correlation matrix with appropriate coefficients
# Pearson product-moment: 2 continuous variables
# Point-biserial: 1 continuous and 1 binary variable
# Phi: 2 binary variables
# For testing purposes use hetcor function
library(polycor)
C <- as.matrix(hetcor(data=data))

# Run PCA
pca <- princomp(covmat=C)
L <- loadings(pca)

# Calculate Scores
scores <- data * L