Solved – SEM with binary dependent variable

binary datarstructural-equation-modeling

Much like with regression, handling binary dependent variables in SEM requires special considerations. In particular, some of these are noted on Dave Garson's Structural Equation Modeling and include:

  1. Polychoric correlation. LISREL/PRELIS uses polyserial, tetrachoric, and polychoric correlations to create the input correlation matrix,
    combined with ADF estimation (see below), for variables which cannot
    be assumed to have a bivariate normal distribution.

    • Sample size issue. ADF [Asymptotically distribution-free] estimation in turn requires a very large sample size. Yuan and Bentler (1994)
      found satisfactory estimates only with a sample size of at least 2,000
      and preferably 5,000. Violating this requirement may introduce
      problems greater than treating ordinal data as interval and using ML
      estimation. This is also a reason cited for preferring the Bayesian
      estimation approach to ordinal data taken by Amos since Bayesian
      estimation can handle smaller samples than ML or ADF.

I'm currently trying to use the package sem in R to test my model, and the author of the model suggests using polychoric correlations on R-help. The problems are:

  1. I don't know what estimation method is being used with these correlations (i.e., ADF or ML).
  2. My sample size is small (N = 173).
  3. I'm not familiar with how to interpret polychoric associations (in the case that it is appropriate for me to use them). All the other variables in my model are continuous in nature.

Any help and/or links would be greatly appreciated. I'm also considering using other software like OpenMX, but I'm still reading about how it handles binary data. Help with what other software I might want to use would also be appreciated.

Best Answer

Did you read the original Olsson (1979) paper? I believe it still provides the best description of what polychoric correlations are (although I've probably skimmed only 10% of the existing literature, I have to admit; at some point, it just gets too repetitive of the limited number of ideas though). Polychoric correlations are ML estimates of the correlations of the underlying normal distribution, so you interpret them just as you would Pearson moment correlations with continuous data. Given the ML origins of polychoric correlations, I never understood the advice to use ADF or other least squares methods with them to obtain model parameter estimates, although I do understand that say diagonally weighted least squares (don't know if John Fox implemented them in sem though), while being less asymptotically efficient, don't need as much auxiliary information for estimation purposes.

There is no magic sample size number, like, you hit 2000 and -- BOOM! -- everything starts working. In my simulations (and I've done a few petaflops this way and that way for my papers), I've seen both cases when asymptotic results worked perfectly fine with $N=200$ and failed to work with $N=5000$. In the most peculiar cases, for the same method and distribution of the underlying data, some asymptotic aspects, such as confidence interval coverage say, would be OK for $N=300$, while others, like $\chi^2$ distribution of a test statistic, would not work until you have $N=1000$. So I am highly skeptical of any sample size advice, and would rather recommend to run a simulation addressing your particular sample size, model complexity and magnitude of the errors. The first paper to bash ADF (Hu, Bentler and Kano (1992)) used an insane degree of overidentification, something like 30 variables in the model, which translates to 400 degrees of freedom, and a sample size of 50. ADF wouldn't even begin to work in these circumstances, as it won't be able to invert the matrix of the fourth moments which will be rank-deficient. And to get 400 degrees of freedom for the test statistic with the sample size below 1000 is a high expectation, too.

So I understand the healthy skepticism that you are demonstrating, but there is simply nothing you can do in your situation about it. Just run polycor to get the correlation estimates, feed them to sem, and that would be it -- there is little you can do to produce a much better analysis.

If you were a Stata user, I would immediately recommend gllamm package, but I am not sure whether a direct analogue of it exists in R.

Related Question