Solved – SEM with binary dependent variable

binary datarstructural-equation-modeling

Much like with regression, handling binary dependent variables in SEM requires special considerations. In particular, some of these are noted on Dave Garson's Structural Equation Modeling and include:

Polychoric correlation. LISREL/PRELIS uses polyserial, tetrachoric, and polychoric correlations to create the input correlation matrix,
combined with ADF estimation (see below), for variables which cannot
be assumed to have a bivariate normal distribution.

Sample size issue. ADF [Asymptotically distribution-free] estimation in turn requires a very large sample size. Yuan and Bentler (1994)
found satisfactory estimates only with a sample size of at least 2,000
and preferably 5,000. Violating this requirement may introduce
problems greater than treating ordinal data as interval and using ML
estimation. This is also a reason cited for preferring the Bayesian
estimation approach to ordinal data taken by Amos since Bayesian
estimation can handle smaller samples than ML or ADF.

I'm currently trying to use the package sem in R to test my model, and the author of the model suggests using polychoric correlations on R-help. The problems are:

I don't know what estimation method is being used with these correlations (i.e., ADF or ML).
My sample size is small (N = 173).
I'm not familiar with how to interpret polychoric associations (in the case that it is appropriate for me to use them). All the other variables in my model are continuous in nature.

Any help and/or links would be greatly appreciated. I'm also considering using other software like OpenMX, but I'm still reading about how it handles binary data. Help with what other software I might want to use would also be appreciated.

Best Answer

Did you read the original Olsson (1979) paper? I believe it still provides the best description of what polychoric correlations are (although I've probably skimmed only 10% of the existing literature, I have to admit; at some point, it just gets too repetitive of the limited number of ideas though). Polychoric correlations are ML estimates of the correlations of the underlying normal distribution, so you interpret them just as you would Pearson moment correlations with continuous data. Given the ML origins of polychoric correlations, I never understood the advice to use ADF or other least squares methods with them to obtain model parameter estimates, although I do understand that say diagonally weighted least squares (don't know if John Fox implemented them in sem though), while being less asymptotically efficient, don't need as much auxiliary information for estimation purposes.

There is no magic sample size number, like, you hit 2000 and -- BOOM! -- everything starts working. In my simulations (and I've done a few petaflops this way and that way for my papers), I've seen both cases when asymptotic results worked perfectly fine with $N=200$ and failed to work with $N=5000$. In the most peculiar cases, for the same method and distribution of the underlying data, some asymptotic aspects, such as confidence interval coverage say, would be OK for $N=300$, while others, like $\chi^2$ distribution of a test statistic, would not work until you have $N=1000$. So I am highly skeptical of any sample size advice, and would rather recommend to run a simulation addressing your particular sample size, model complexity and magnitude of the errors. The first paper to bash ADF (Hu, Bentler and Kano (1992)) used an insane degree of overidentification, something like 30 variables in the model, which translates to 400 degrees of freedom, and a sample size of 50. ADF wouldn't even begin to work in these circumstances, as it won't be able to invert the matrix of the fourth moments which will be rank-deficient. And to get 400 degrees of freedom for the test statistic with the sample size below 1000 is a high expectation, too.

So I understand the healthy skepticism that you are demonstrating, but there is simply nothing you can do in your situation about it. Just run polycor to get the correlation estimates, feed them to sem, and that would be it -- there is little you can do to produce a much better analysis.

If you were a Stata user, I would immediately recommend gllamm package, but I am not sure whether a direct analogue of it exists in R.

Related Solutions

R – Help with SEM Modeling Using OpenMx and Polycor

You must have uncovered a bug in polycor, which you would want to report to the John Fox. Everything runs fine in Stata using my polychoric package:

    . polychoric *

    Polychoric correlation matrix

               A1          A2          A3          A4          A5          B1          B2          B3          C1          D1          E1
   A1           1
   A2   .34544812           1
   A3   .39920225   .19641726           1
   A4   .09468652   .04343741   .31995685           1
   A5   .30728339   -.0600463   .24367634   .18099061           1
   B1   .01998441  -.29765985   .13740987   .21810968   .14069473           1
   B2  -.19808738   .17745687  -.29049459  -.21054867   .02824307  -.57600551          1
   B3   .17807109  -.18042045   .44605383   .40447746   .18369998   .49883132  -.50906364           1
   C1  -.35973454  -.33099295  -.19920454  -.14631621  -.36058235   .00066762  -.05129489  -.11907687           1
   D1   -.3934594  -.21234022  -.39764587  -.30230591  -.04982743  -.09899428   .14494953   -.5400759   .05427906           1
   E1  -.13284936   .17703745  -.30631236  -.23069382  -.49212315  -.26670382   .24678619  -.47247566    .2956692   .28645516           1

For the latent variables that are measured with a single indicator (C, D, E), you need to fix the variance of the indicator in the continuous version of it, as otherwise the scale of the latent variable is not identified. Given that with binary/ordinal responses, it is fixed anyway to 1 with (ordinal) probit-type links, it probably means that you would have to postulate that your latent is equivalent to the observed indicator, or you have to postulate the standardized loading. This essentially makes your model equivalent to a CFA model where you have latent factors A and B measured with {A1-A5, C1, D1, E1} and {B1-B3, C1, D1, E1}, respectively.

Bootstrap – Effective Bootstrapping Techniques in SEM for Small Sample Sizes

Yes, SEM requires a larger size. The reason being that SEM is doing two things: First, it's trying to find a model, and then it finds the standard errors of that model.

There are two problems. The first is that you will have trouble estimating the model(s).

If you have problems with your standard errors (because, say, of non-normality) then bootstrapping might help you. But if you try to run a SEM model with a small sample size, you'll find that you don't get a sensible model to interpret - the model will frequently not converge, or converge with out of bounds estimates (variances < 0; correlations > 1 [perhaps MUCH greater than one - one sometimes sees correlations that are in the three digit range]).

So when you try to bootstrap a model with a small sample size you might find that 25% of the bootstrap samples are clearly wacky and should be discarded. And some proportion of the rest are also wacky, but you don't have a good way to decide which ones. If you did, you could go ahead and use the standard errors.

The second problem is that ML tends to be biased in small samples.

Best Answer

Related Solutions

R – Help with SEM Modeling Using OpenMx and Polycor

Bootstrap – Effective Bootstrapping Techniques in SEM for Small Sample Sizes

Related Question