Solved – Troubleshooting underidentification issues in structural equation modelling (SEM)

confirmatory-factorrstructural-equation-modeling

Background

Introduction

I have a data set consisting of data collected from a questionnaire that I wish to validate. I have chosen to use confirmatory factor analysis to analyse this data set.

Instrument

The instrument consists of 11 subscales. There is a total of 68 items in the 11 subscales. Each item is scored on an integer scale between 1 to 4.

Confirmatory factor analysis (CFA) setup

I use the sem package to conduct the CFA. My code is as below:

cov.mat <- as.matrix(read.table("http://dl.dropbox.com/u/1445171/cov.mat.csv", sep = ",", header = TRUE))
rownames(cov.mat) <- colnames(cov.mat)

model <- cfa(file = "http://dl.dropbox.com/u/1445171/cfa.model.txt", reference.indicators = FALSE)
cfa.output <- sem(model, cov.mat, N = 900, maxiter = 80000, optimizer = optimizerOptim)

Warning message:
In eval(expr, envir, enclos) : Negative parameter variances.
Model may be underidentified.

Straight off you might notice a few anomalies, let me explain.

  • Why is the optimizer chosen to be optimizerOptim?

ANS: I originally stuck with the default optimizerSem but no matter how many iterations I run, either I run out of memory first (8GB RAM setup) or it would report no convergence Things "seemed" a little better when I switched to optimizerOptim where by it would conclude successfully but throws up the error that the model is underidentified. Upon closer inspection, I realise that the output shows convergence as TRUE but iterations is NA so I am not sure what is exactly happening.

  • The maxiter is too high.

ANS: If I set it to a lower value, it refuses to converge, although as mentioned above, I doubt real convergence actually occurred.

Problem

So by now I guess that the model is really underidentified so I looked for resources to resolve this problem and found:

I followed the 2nd link quite closely and applied the t-rule:

  • I have 68 observed variables, providing me with 68 variances and 2278 covariances between variables = 2346 data points.
  • I also have 68 regression coefficients, 68 error variances of variables, 11 factor variances and 55 factor covariances to estimate making it a total of 191 parameters.
  • Since I will be fixing the variances of the 11 latent factors to 1 for scaling, I would remove them from the parameters to estimate making it a total of 180 parameters to estimate.
    • My degrees of freedom is therefore 2346 – 180 = 2166, making it an over identified model by the t-rule.

Questions

  1. Is the low variance of some of my items a possible cause for the underidentification? I asked a previous question on items with zero variance which led me to think about items which are very close to zero. Should they be removed too? Confirmatory factor analysis using SEM: What do we do with items with zero variance?
  2. After reading much, I surmise that the underidentification might be a case of empirical underidentification. Is there a systematic way of diagnosing what kind of underidentification it is? And what are my options to proceed with my analysis?

I have more questions but let's take it at these 2 for now. Thanks for any help!

Best Answer

Ken Bollen and I wrote about negative variance estimates (aka Heywood cases). You might want to take a look for some insights. For this huge model, God only knows how model misspecifications are going to show up, but in my experience, Heywood cases are typical outlets for the model to let the steam out when something is not fitting right.

That having said, I would try different diagnostics: first fit all of the submodels with 6 or so indicators, and see if there's anything wrong with them. In the CFA context, I would imagine that underidentification would arise only if some variables have zero coefficient/covariance with the factors they are supposed to measure. You should be able to catch that with the analysis of subscales.

Finally, for the Likert scales with 4 categories, you really should use polychoric correlations (polycor package). For one thing, the categorical nature of the data would yield the likelihood ratio tests unreliable (as if I would trust that 900 observations could give rise to 2166 independent degrees of freedom, anyway).