How to Conduct Factor Analysis on Binary Data in R – Step-by-Step Guide

binary datafactor analysispsychometricsr

I have some dichotomous data, only binary variables, and my boss asked me to perform a factor analysis using the tetrachoric correlations matrix. I’ve previously been able to teach myself how to run different analyses based on the examples here and at the UCLA’s stat site and other sites like it, but I can’t seem to find a step through an example of a factor analysis on dichotomous data (binary variables) using R.

I did see chl's response to a somewhat simular question and I also saw ttnphns' answer, but I am looking for something even more spelled out, a step through an example I can work with.

Does anyone here know of such a step through an example of a factor analysis on binary variables using R?

Update 2012-07-11 22:03:35Z

I should also add that I am working with an established instrument, that have three dimension, to which we have added some additional questions and we now hope to find four distinct dimension.
Furthermore, our sample size is only $n=153$, and we currently have $19$ items. I compared our sample size and our number of items to a number of psychology articles and we are definitely in the lower end, but we wanted to try it anyway.
Though, this is not important for the step through example I am looking for and caracal’s example below looks really amazing. I will work my way thru it using my data first thing in the morning.

Best Answer

I take it the focus of the question is less on the theoretical side, and more on the practical side, i.e., how to implement a factor analysis of dichotomous data in R.

First, let's simulate 200 observations from 6 variables, coming from 2 orthogonal factors. I'll take a couple of intermediate steps and start with multivariate normal continuous data that I later dichotomize. That way, we can compare Pearson correlations with polychoric correlations, and compare factor loadings from continuous data with that from dichotomous data and the true loadings.

set.seed(1.234)
N <- 200                             # number of observations
P <- 6                               # number of variables
Q <- 2                               # number of factors

# true P x Q loading matrix -> variable-factor correlations
Lambda <- matrix(c(0.7,-0.4, 0.8,0, -0.2,0.9, -0.3,0.4, 0.3,0.7, -0.8,0.1),
                 nrow=P, ncol=Q, byrow=TRUE)

Now simulate the actual data from the model $x = \Lambda f + e$, with $x$ being the observed variable values of a person, $\Lambda$ the true loadings matrix, $f$ the latent factor score, and $e$ iid, mean 0, normal errors.

library(mvtnorm)                      # for rmvnorm()
FF  <- rmvnorm(N, mean=c(5, 15), sigma=diag(Q))    # factor scores (uncorrelated factors)
E   <- rmvnorm(N, rep(0, P), diag(P)) # matrix with iid, mean 0, normal errors
X   <- FF %*% t(Lambda) + E           # matrix with variable values
Xdf <- data.frame(X)                  # data also as a data frame

Do the factor analysis for the continuous data. The estimated loadings are similar to the true ones when ignoring the irrelevant sign.

> library(psych) # for fa(), fa.poly(), factor.plot(), fa.diagram(), fa.parallel.poly, vss()
> fa(X, nfactors=2, rotate="varimax")$loadings     # factor analysis continuous data
Loadings:
     MR2    MR1   
[1,] -0.602 -0.125
[2,] -0.450  0.102
[3,]  0.341  0.386
[4,]  0.443  0.251
[5,] -0.156  0.985
[6,]  0.590       

Now let's dichotomize the data. We'll keep the data in two formats: as a data frame with ordered factors, and as a numeric matrix. hetcor() from package polycor gives us the polychoric correlation matrix we'll later use for the FA.

# dichotomize variables into a list of ordered factors
Xdi    <- lapply(Xdf, function(x) cut(x, breaks=c(-Inf, median(x), Inf), ordered=TRUE))
Xdidf  <- do.call("data.frame", Xdi) # combine list into a data frame
XdiNum <- data.matrix(Xdidf)         # dichotomized data as a numeric matrix

library(polycor)                     # for hetcor()
pc <- hetcor(Xdidf, ML=TRUE)         # polychoric corr matrix -> component correlations

Now use the polychoric correlation matrix to do a regular FA. Note that the estimated loadings are fairly similar to the ones from the continuous data.

> faPC <- fa(r=pc$correlations, nfactors=2, n.obs=N, rotate="varimax")
> faPC$loadings
Loadings:
   MR2    MR1   
X1 -0.706 -0.150
X2 -0.278  0.167
X3  0.482  0.182
X4  0.598  0.226
X5  0.143  0.987
X6  0.571       

You can skip the step of calculating the polychoric correlation matrix yourself, and directly use fa.poly() from package psych, which does the same thing in the end. This function accepts the raw dichotomous data as a numeric matrix.

faPCdirect <- fa.poly(XdiNum, nfactors=2, rotate="varimax")    # polychoric FA
faPCdirect$fa$loadings        # loadings are the same as above ...

EDIT: For factor scores, look at package ltm which has a factor.scores() function specifically for polytomous outcome data. An example is provided on this page -> "Factor Scores - Ability Estimates".

You can visualize the loadings from the factor analysis using factor.plot() and fa.diagram(), both from package psych. For some reason, factor.plot() accepts only the $fa component of the result from fa.poly(), not the full object.

factor.plot(faPCdirect$fa, cut=0.5)
fa.diagram(faPCdirect)

output from factor.plot() and fa.diagram()

Parallel analysis and a "very simple structure" analysis provide help in selecting the number of factors. Again, package psych has the required functions. vss() takes the polychoric correlation matrix as an argument.

fa.parallel.poly(XdiNum)      # parallel analysis for dichotomous data
vss(pc$correlations, n.obs=N, rotate="varimax")   # very simple structure

Parallel analysis for polychoric FA is also provided by the package random.polychor.pa.

library(random.polychor.pa)    # for random.polychor.pa()
random.polychor.pa(data.matrix=XdiNum, nrep=5, q.eigen=0.99)

output from fa.parallel.poly() and random.polychor.pa()

Note that the functions fa() and fa.poly() provide many many more options to set up the FA. In addition, I edited out some of the output which gives goodness of fit tests etc. The documentation for these functions (and package psych in general) is excellent. This example here is just intended to get you started.