Solved – When would one pre-specify thresholds in SEM/CFA for limited dependent variables

categorical dataconfirmatory-factordiscrete datalavaanstructural-equation-modeling

I'm working on a structural equation model with limited dependent (discrete) variables using lavaan (0.5-17) in R (3.1.2).

I have found that if I define the thresholds for the categorical variables, the goodness of fit declines drastically, yet the coefficients/standard errors remain almost exactly the same when compared to the same model without user-defined thresholds. I also find that when thresholds are not user-defined, they often take on values that are well outside the range of the categorical variable and its underlying variable.

My question is: should I define the thresholds, or should I let the software estimate them? More broadly, under what circumstances would one pre-specify thresholds in SEM/CFA?

I also hope that somebody can explain the poor goodness of fit, despite nearly identical other parameters, and what it means that the estimated thresholds often fall outside the range of the underlying variable. I suspect these questions point to limitations in my understanding of SEM.

Example:

library(lavaan)

# make a categorical variable from the PoliticalDemocracy example. Place thresholds at 4.5, 5 and 5.5
newData <- PoliticalDemocracy
newData$x1[PoliticalDemocracy$x1 < 4.5] <- 1
newData$x1[PoliticalDemocracy$x1 >= 4.5 & PoliticalDemocracy$x1 < 5] <- 2
newData$x1[PoliticalDemocracy$x1 >= 5 & PoliticalDemocracy$x1 < 5.5] <- 3
newData$x1[PoliticalDemocracy$x1 >=5.5] <- 4
newData$x1 <- factor(newData$x1, ordered=TRUE)

# specify a model allowing thresholds to be free parameters
freeModel <- '
   # latent variable definitions
     ind60 =~ x1 + x2 + x3
     dem60 =~ y1 + y2 + y3 + y4
     dem65 =~ y5 + y6 + y7 + y8
   # regressions
     dem60 ~ ind60
     dem65 ~ ind60 + dem60
   # residual covariances
     y1 ~~ y5
     y2 ~~ y4 + y6
     y3 ~~ y7
     y4 ~~ y8
     y6 ~~ y8'

# specify a model with user-defined thresholds (by adding thresholds to the original model)
specifiedModel <- paste(freeModel, '
                x1 | 4.51*t1 + 5.01*t2 + 5.51*t3')

# fit models, fit the original model for comparison
originalFit     <- sem(freeModel, data=PoliticalDemocracy)
ldvFreeFit      <- sem(freeModel, data=newData)
ldvSpecifiedFit <- sem(specifiedModel, data=newData)

# display estimates
summary(ldvFreeFit)
summary(ldvSpecifiedFit)

# display goodness of fit
print(as.data.frame(fitMeasures(ldvFreeFit))[rownames(as.data.frame(fitMeasures(ldvFreeFit)))=='rmsea.scaled',])
print(as.data.frame(fitMeasures(ldvSpecifiedFit))[rownames(as.data.frame(fitMeasures(ldvSpecifiedFit)))=='rmsea.scaled',])

Output (truncated). Note that the estimated thresholds are -0.544, -0.117 and 0.623, not 4.5, 5 and 5.5 as I specified them:

> summary(ldvFreeFit)
                   Estimate  Std.err  Z-value  P(>|z|)
Latent variables:
  ind60 =~
    x1                1.000
    x2                1.483    0.194    7.639    0.000
    x3                1.186    0.208    5.694    0.000
  dem60 =~
    y1                1.000
    y2                1.177    0.381    3.090    0.002
    ...                 ...      ...      ...      ...

Regressions:
  dem60 ~
    ind60             1.026    0.303    3.386    0.001
  dem65 ~
    ind60             0.405    0.150    2.699    0.007
    dem60             0.896    0.133    6.758    0.000
     ...                ...      ...      ...      ...
Thresholds:
    x1|t1            -0.544    0.154   -3.535    0.000
    x1|t2            -0.117    0.146   -0.803    0.422
    x1|t3             0.623    0.156    3.982    0.000

> summary(ldvSpecifiedFit)
                   Estimate  Std.err  Z-value  P(>|z|)
Latent variables:
  ind60 =~
    x1                1.000
    x2                1.483    0.194    7.639    0.000
    x3                1.186    0.208    5.694    0.000
  dem60 =~
    y1                1.000
    y2                1.177    0.381    3.090    0.002
    ...                 ...      ...      ...      ...

Regressions:
  dem60 ~
    ind60             1.026    0.303    3.386    0.001
  dem65 ~
    ind60             0.405    0.150    2.699    0.007
    dem60             0.896    0.133    6.758    0.000
     ...                ...      ...      ...      ...
Thresholds:
    x1|t1             4.510
    x1|t2             5.010
    x1|t3             5.510

# display goodness of fit
> print(as.data.frame(fitMeasures(ldvFreeFit))[rownames(as.data.frame(fitMeasures(ldvFreeFit)))=='rmsea.scaled',])
[1] 0.04448965
> print(as.data.frame(fitMeasures(ldvSpecifiedFit))[rownames(as.data.frame(fitMeasures(ldvSpecifiedFit)))=='rmsea.scaled',])
[1] 1.686976

Best Answer

The thresholds are on a logit scale, so it's the log-odds (just like in logistic regression, or ordinal logistic regression). The thresholds are just like the intercept in a regular regression model. They give the expected log odds of a value, given that the predictors (including latents) are equal to zero.

If in a model with continuous variables, you constrained the intercept of a variable to be some incorrect value, then this will mess up your model fit, but it shouldn't change the rest of the model.

Most of the time you can ignore the thresholds. You sometimes need to constrain one of them for identification purposes. For example, to identify the mean of the latent variable, you can either constrain the mean to zero, or constrain a threshold of an indicator (just like constraining one of the loadings to identify the variance). The default in Lavaan is to constrain the mean / intercept of the latent to be equal to zero.

You constrain thresholds if you want to make the means/intercepts of latent variables comparable. E.g. if you have a multiple group model and want to compare the means of the latents, the default is to fix both means to zero. that doesn't make sense, so instead you constrain one of the means to zero, and constrain the thresholds to be equal across groups.

Related Question