Solved – What are dangers of calculating Pearson correlations (instead of tetrachoric ones) for binary variables in factor analysis

binary datacategorical datafactor analysisr

I do research on educational games, and some of my current projects involve using data from BoardGameGeek (BGG) and VideoGameGeek (VGG) to examine relationships between design elements of games (i.e., "set in World War II", "involves rolling dice") and player ratings of those games (i.e., scores out of 10). Each of these design elements corresponds with a tag in the BGG or VGG system, so each element is essentially a dichotomous variable. A game has a 1 for every tag that's present in the database for it, and a 0 for every tag that isn't present.

There are dozens of these tags, so I want to use exploratory factor analysis (EFA) to come up with a manageable number of "genres" that capture patterns in game design. Consulting several sources, I understand that since I'm working with dichotomous variables, I ought to use polychoric correlations (tetrachoric, particularly here) instead of Pearson ones when coming up with my factors (there are also other options—like latent trait analysis—out there, but this is the one I'm exploring for now).

Out of curiosity, I came up with two sets of factors, one using Pearson correlations and the other using polychoric correlations (same number of factors each time). My problem is that the factors computed using Pearson correlations make a lot more sense and are easier to interpret than the factors computed using polychoric correlations. In other words, the "genres" from the first set of factors make intuitive sense and correspond with my understanding of how games are typically designed; that is not the case for the second set of factors.

On one hand, I want to make sure that I meet the assumptions of the tests that I'm using, even if that makes my results less pretty. On the other, I feel that part of the goal of factor analysis and (more broadly) model-building is to come up with something useful, and the more useful information is emerging when I'm "breaking the rules." Is the need for a useful model enough to outweigh violating the assumptions of this test? What exactly are the consequences of using Pearson correlations instead of polychoric ones?

Best Answer

Linear Factor analyis is theoretically, logically for continuous variables only. If variables are not continuous but are, for example, dichotomous, one way for you shall be to admit underlying continuous variables behind and declare that the observed variables are the binned underlying or true ones. You cannot quantify a dichotomous variable into a scale one without an extraneous "tutor", but you can still infer the correlations which would be if your variables had not been binned yet and were "original" continuous normally distributed. And this is the tetrachoric correlations (or polychoric, if in place of binary you have ordinal variables). So, using tetrachoric correlations (inferred Pearson correlations) in place of Phi correlations (observed Pearson correlations with dichotomous data) is a logical act.

Phi correlations computed on dichotomously binned variables are very sensitive to the cut point (aka "difficulty level of task") over which the binning took place. A pair of variables could hope to attain the theoretical bound $r=1$ only when they are binned over the equivalent cut point. The more different was the cut point in them the lower is the maximal bound of possible $r$ between them. (This is the general effect of the sameness of marginal distributions on the possible range for Pearson $r$, but in dichotomous variables this effect is most sharp because too few values to take on.) So, phi correlations in their matrix can be seen as unequally deflated due to contrasting marginal distributions in the dichotomous variables; you don't know if one correlation is larger than another "truly" or due to the different cut points in these two pairs of variables. The number of factors to extract (following criterions such as Kaiser's "eigenvalue>1") will be inflated: some extracted "factors" being the outcome of the unevenness, diversity of the cut points, - not substantive latent factors. This is practical reason why not use phi correlations (at least in their raw - nonrescaled) form.

There has been evidence in simulation/binning studies that factor analysis based on tetrachoric correlations worsens if there are many strong (>0.7) correlations in the matrix. Tetrachoric correlation is not ideal: if the cut-points of the correlating underlying variables are at the opposites (and so the marginal distributions in the dichotomous are oppositely skewed) while the underlying association is strong, tetrachoric coefficient overestimates it further. Note also that tetrachoric correlation matrix isn't necessarily positive semidefinite in not large samples and might thus need correction ("smoothing"). Still, it is regarded by many a better way than doing factor analysis on plain Pearson (phi) coefficients.

But why do namely factor analysis on binary data at all? There are other options, including latent trait / IRT (a form of "logistic" factor analysis) and Multiple Correspondence analysis (if you see your binary variables as nominal categories).

See also:

  • Assumptions of linear factor analysis.
  • Rescaled Pearson $r$ could be (but not very convincing) alternative to tetrachotic $r$ for FA.