Solved – How to factor analyze two binary variables only

binary datacorrespondence-analysisfactor analysispcareferences

Normally when I do factor analysis, I have a whole bunch of variables that need to be reduced. But here I only have two binary variables (yes/no) that I need to reduce into one interval factor. Is Principle Components / Factor Analysis appropriate for this? When I do it, my extraction communalities are really high. I might need a reference to back this up with reviewers.

Best Answer

It is normally considered that three is the minimum number of variables to conduct factor analysis; amongst elsewhere this is maintained in the Wikipedia article (which has a reference) and in some (most? all?) statistical software.

There is no reason however that you can't do principal components analysis (which is not the same as factor analysis, although closely related) to identify which principal component explains most of the variance, even if you only have two binary variables. The correlations between the two can still be calculated.

See for example the below, where Bin1 and Bin3 are correlated binary variables. The first principal component explains most of the variance, and naturally is equally weighted on both of the original variables.

> eg <- data.frame(
+ Bin1 =sample(c(0,1),1000, replace=TRUE),
+ Bin2 =sample(c(0,1),1000, replace=TRUE))
> 
> eg$Bin3 <- ifelse(runif(1000)>.2, eg$Bin1, eg$Bin2)
> cor(eg)
            Bin1        Bin2      Bin3
Bin1  1.00000000 -0.05206971 0.8081088
Bin2 -0.05206971  1.00000000 0.1404252
Bin3  0.80810881  0.14042523 1.0000000


> mod <- princomp(eg[,c("Bin1", "Bin3")])
> ld <- mod$loadings
> attach(eg)
> plot(jitter(Bin1), jitter(Bin3), bty="l", 
main="Jittered version of binary data,\nwith first principal component shown")
> grid()
> lines(Bin1, ld[2,1]/ld[1,1] * (Bin1-mean(Bin1)) + mean(Bin3), col="red")

The scatterplot of the two binary correlated variables (points are jittered):

enter image description here

> summary(mod)
Importance of components:
                          Comp.1     Comp.2
Standard deviation     0.6722961 0.21901598
Proportion of Variance 0.9040544 0.09594559
Cumulative Proportion  0.9040544 1.00000000
> par(mfrow=c(1,2))
> plot(mod)
> biplot(mod)

Component eigenvalues, left, and biplot (loadings+scores), right:

enter image description here

Related Question