R – Help with SEM Modeling Using OpenMx and Polycor

modelingmultiple regressionrstructural-equation-modeling

I have lots of problem with one data set to which I am trying to apply SEM.

We suppose the existence of 5 latent factors A, B, C, D, E, with indicators resp. A1 to A5 (ordered factors), B1 to B3 (quantitative), C1, D1, E1 (all last three ordered factors, with only 2 levels for E1. We are interested in covariances between all factors.

I tried to use OpenMx to do so. Here are a few of my attempts:

  • I first tried to use thresholds matrices for all the ordered factors, but the convergence failed.

  • I decided to use polychoric/polyserial correlations instead of raw data, with the function hetcor from the library polycor (I planned to bootstrap the sample to get confidence intervals). It also fails to converge!

  • I tried to restrict to individuals with complete data, it fails too!

My first question is: is there a natural way to interpret these failures?

My second question is: what should I do???

Edit: for future readers who may encounter the same problem, after going through the code of the functions in polycor… the solution is simply to use hetcor() with the option std.err=FALSE. This gives estimates very similar to those StasK gave. I lack time now to understand better what’s going on here! The questions below have been pretty well answered by StasK.

I have other questions, but before anything, here is a url with a RData file containing a data frame L1 containing only the complete data: data_sem.RData

Here a few lines of codes showing the failure of hetcor.

> require("OpenMx")
> require("polycor")
> load("data_sem.RData")
> hetcor(L1)
Erreur dans cut.default(scale(x), c(-Inf, row.cuts, Inf)) : 
  'breaks' are not unique
De plus : Il y a eu 11 avis (utilisez warnings() pour les visionner)
> head(L1)
   A1 A2 A3 A4 A5       B1       B2       B3 C1 D1 E1
1   4  5  4  5  7 -0.82759  0.01884 -3.34641  4  6  1
4   7  5  0  4  6 -0.18103  0.14364  0.35730  0  1  0
7   7  5  7  6  9 -0.61207 -0.18914  0.13943  0  0  0
10  5  5 10  7  3 -1.47414  0.10204  0.13943  2  0  0
11  7  5  8  9  9 -0.61207  0.06044 -0.73203  0  2  0
12  5  5  9 10  5  0.25000 -0.52192  1.44662  0  0  0

But I can still compute a correlation or a covariance matrix in a very dirty way, considering my ordered factors as quantitative variables:

> Cor0 <- cor(data.frame(lapply(L1, as.numeric)))

Here is a piece of OpenMx code together with my next question: is the following model correct? Not too much free parameters?

manif <- c("A1","A2","A3","A4","A5", "B1","B2","B3", "C1", "D1", "E1");

model1 <- mxModel(type="RAM",
        manifestVars=manif, latentVars=c("A","B","C","D","E"),
        # factor variance
        mxPath(from=c("A","B","C","D","E"), arrows=2, free=FALSE, values = 1),
        # factor covariance
        mxPath(from="A", to="B",  arrows=2, values=0.5),
        mxPath(from="A", to="C",  arrows=2, values=0.5),
        mxPath(from="A", to="D",  arrows=2, values=0.5),
        mxPath(from="A", to="E",  arrows=2, values=0.5),
        mxPath(from="B", to="C",  arrows=2, values=0.5),
        mxPath(from="B", to="D",  arrows=2, values=0.5),
        mxPath(from="B", to="E",  arrows=2, values=0.5),
        mxPath(from="C", to="D",  arrows=2, values=0.5),
        mxPath(from="C", to="E",  arrows=2, values=0.5),
        mxPath(from="D", to="E",  arrows=2, values=0.5),
        # factors → manifest vars
        mxPath(from="A", to=c("A1","A2","A3","A4","A5"), free=TRUE, values=1),
        mxPath(from="B", to=c("B1","B2","B3"), free=TRUE, values=1),
        mxPath(from="C", to=c("C1"), free=TRUE, values=1),
        mxPath(from="D", to=c("D1"), free=TRUE, values=1),
        mxPath(from="E", to=c("E1"), free=TRUE, values=1),
        # error terms
        mxPath(from=manif, arrows=2, values=1, free=TRUE),
        # data
        mxData(Cor0, type="cor",numObs=dim(L1)[1])
       );

And one last question. With this model (let’s forget for a moment the inappropriate way the correlation matrix is computed), I run OpenMx:

> mxRun(model1) -> fit1
Running untitled1 
> summary(fit1)

among the summary, this:

observed statistics:  55 
estimated parameters:  32 
degrees of freedom:  23 
-2 log likelihood:  543.5287 
saturated -2 log likelihood:  476.945 
number of observations:  62 
chi-square:  66.58374 
p:  4.048787e-06 

The fit seems very bad, despite the huge number of parameters. What does that mean? Does that mean that we should add covariances between manifest variables?

Many thanks in advance for all your answers, I am slowly becoming insane…

Best Answer

You must have uncovered a bug in polycor, which you would want to report to the John Fox. Everything runs fine in Stata using my polychoric package:

    . polychoric *

    Polychoric correlation matrix

               A1          A2          A3          A4          A5          B1          B2          B3          C1          D1          E1
   A1           1
   A2   .34544812           1
   A3   .39920225   .19641726           1
   A4   .09468652   .04343741   .31995685           1
   A5   .30728339   -.0600463   .24367634   .18099061           1
   B1   .01998441  -.29765985   .13740987   .21810968   .14069473           1
   B2  -.19808738   .17745687  -.29049459  -.21054867   .02824307  -.57600551          1
   B3   .17807109  -.18042045   .44605383   .40447746   .18369998   .49883132  -.50906364           1
   C1  -.35973454  -.33099295  -.19920454  -.14631621  -.36058235   .00066762  -.05129489  -.11907687           1
   D1   -.3934594  -.21234022  -.39764587  -.30230591  -.04982743  -.09899428   .14494953   -.5400759   .05427906           1
   E1  -.13284936   .17703745  -.30631236  -.23069382  -.49212315  -.26670382   .24678619  -.47247566    .2956692   .28645516           1

For the latent variables that are measured with a single indicator (C, D, E), you need to fix the variance of the indicator in the continuous version of it, as otherwise the scale of the latent variable is not identified. Given that with binary/ordinal responses, it is fixed anyway to 1 with (ordinal) probit-type links, it probably means that you would have to postulate that your latent is equivalent to the observed indicator, or you have to postulate the standardized loading. This essentially makes your model equivalent to a CFA model where you have latent factors A and B measured with {A1-A5, C1, D1, E1} and {B1-B3, C1, D1, E1}, respectively.

Related Question