Solved – Is there Factor analysis or PCA for ordinal or binary data

binary datafactor analysislikertordinal-datapca

I have completed the principal component analysis (PCA), exploratory factor analysis (EFA), and confirmatory factor analysis (CFA), treating data with likert scale (5-level responses: none, a little, some,..) as a continuous variable. Then, using Lavaan, I repeated the CFA defining the variables as categorical.

I would like to know what types of analyses would be appropriate for and would be equivalent of PCA and EFA when data are ordinal in nature. And when binary.

I would also appreciate suggestions for specific packages or softwares that can be easily implemented for such analyses.

Best Answer

Traditional (linear) PCA and Factor analysis require scale-level (interval or ratio) data. Often likert-type rating data are assumed to be scale-level, because such data are easier to analyze. And the decision is sometimes warranted statistically, especially when the number of ordered categories is greater than 5 or 6. (Albeit purely logically the question of the data type and the number of scale levels are distinct.)

What if you prefer to treat polytomous likert scale as ordinal, though? Or you have dichotomous data? Is it possible to do exploratory factor analysis or PCA for them?

There are currently three main approaches to perform FA (including PCA as its special case) on categorical ordinal or binary variables (read also this account about binary data case, and this consideration about what might be done with ordinal scale).

  1. Optimal scaling approach (a family of applications). Also called Categorical PCA (CatPCA) or nonlinear FA. In CatPCA, ordinal variables are monotonically transformed ("quantified") into their "underlying" interval versions under the objective to maximize the variance explained by the selected number of principal components extracted from those interval data. Which makes the method openly goal-driven (rather than theory-driven) and important to decide on the number of principal components in advance. If true FA is needed instead of PCA, usual linear FA can then naturally be performed on those transformed variables output from CatPCA. With binary variables, CatPCA (regrettably?) behaves in the manner of usual PCA, that is, as if they are continuous variables. CatPCA accepts also nominal variables and any mixture of variable types (nice).

  2. Inferred underlying variable approach. Also known as PCA/FA performed on tetrachoric (for binary data) or polychoric (for ordinal data) correlations. Normal distribution is assumed for the underlying (then binned) continuous variable for every manifest variable. Then classic FA is applied to analyze the aforesaid correlations. The approach easily allows for a mixture of interval, ordinal, binary data. One disadvantage of the approach is that - at inferring the correlations - it has no clues to the multivariate distribution of the underlying variables, - can "conceive of" at most bivariate distributions, thus bases itself not on full information.

  3. Item response theory (IRT) approach. Sometimes also called logistic FA or latent trait analysis. A model very close to binary logit (for binary data) or proportional log odds (for ordinal data) model is applied. The algorithm is not tied with decomposing of a correlation matrix, so it is a bit away from traditional FA, still it is a bona fide categorical FA. "Discrimination parameters" closely correspond to loadings of FA, but "difficulties" replace the notion of "uniquenesses" of FA. IRT fitting certainty quickly decreases as the number of factors grows, which is a problematic side of this approach. IRT is extandible in its own way to incorporate mixed interval+binary+ordinal and possibly nominal variables.

Factor scores in approaches (2) and (3) are more difficult to estimate than factor scores in classic FA or in approach (1). However, several methods do exist (expected or maximum aposteriori methods, maximum likelihood method, etc.).

Factor analysis model assumptions is chiefly the same in the three approaches as in traditional FA. Approach (1) is available in R, SPSS, SAS (to my mind). Approaches (2) and (3) are implemented mostly in specialized latent-variable packages - Mplus, LISREL, EQS.

  1. Polynomial approach. That has not been developed in full yet. Principal components can be modeled as polynomial combinations of variables (using polynomials is a popular way to model nonlinear effects of ordinal regressors.). Also, observed categories in turn can be modeled as discrete manifestations of polynomial combinations of latent factors.

  2. There exist a flourishing field of nonlinear techniques of dimensionality reduction; some of them can be applied or adopted to work with categorical data (especially binary or after binarizing into a high-dimensional sparse dataset).

  3. Performing classic (linear) FA/PCA on rank correlations or other associations suited for categorical data (Spearman/Kendall/Somer's etc.). In case of ordinal data, that is purely heuristic approach, lacking theoretical grounds and not recommended at all. With binary data, Spearman rho and Kendall tau-b correlations and Phi association all equal Pearson r correlation, therefore using them is nothing but doing usual linear FA/PCA on binary data (some perils of it here). It is also possible (albeit not unquestionable) doing the analysis on $r$ rescaled wrt its current magnitude bound.

Look also in this, this, this, this, this, this, this, this.