Estimated covariance matrix and sample covariance matrix of SEM

confirmatory-factorcovariance-matrixstructural-equation-modeling

Normal Covariance

I have tried looking high and low for an answer to this question, but I seem to never get a great answer on it. First, I think I'm knowledgeable about what a covariance matrix is. It should look something like the following:

# Sample data frame:
  x   y    z
1 1 700 6000
2 2 211 3222
3 5 453 1444
4 7 213 3222
5 5 564 6754
6 3 212 4567
7 2 234 3241
8 8 564 5679

# Covariance Matrix:
           x            y            z
x   6.410714     45.16071     117.2679
y  45.160714  39984.41071  216644.3036
z 117.267857 216644.30357 3211178.6964

In the covariance matrix, each diagonal is simply the variance of x, y, or z, while the other numbers are how much they covary, the formula being:

enter image description here

SEM Models

I believe I am also pretty clear on how latent variables are formed in CFA/SEM. The basic formula for regressions between each manifest variable, latent variable, factor loading, and its error estimation being the following:

enter image description here

With this being the covariance matrix of each latent variable:

enter image description here

And as an extension of that, I believe this is the covariance of all latent variables in the SEM model;

enter image description here

However, I still haven't really been made clear on how this works with estimation in SEM.

Estimated Matrix and Model Fit for SEM

What I am still totally confused about is how Structural Equation Modeling (SEM) estimates a covariance matrix and additionally how the model fit indices are estimated with this information. As far as it is my understanding, the estimated matrix is somehow compared to the actual raw covariance matrix, thereafter producing estimated models of fit, but again, I'm still not entirely sure how that works.

Any answers would be greatly appreciated and I'm sorry if this question has been asked a zillion times. I tried looking for this specific way of wording my question and didn't find anything. Please keep in mind that I'm not amazing at mathematical notation, so the simplest possible answer is preferred.

Best Answer

I'm not sure I understand all of your notation.

You have three latent variables, so why does the covariance matrix of latent variables have 9 rows and columns? (And what is $\gamma$?) (And what are $\phi$ and $\theta$ - different authors use symbols differently, so it's best to define what you mean).

You have $S$, your sample covariance matrix.

You have a model implied covariance matrix $\Sigma$ (or sometimes $\Sigma(\theta)$.

You have a model - the model implies $\Sigma(\theta)$.

In a CFA model (such as you have):

$\Sigma(\theta) = \Lambda\Phi\Lambda' + \delta$

Where $\Lambda$ is the matrix of loadings, $\delta$ is the matrix of errors (E in your case, which is a diagonal matrix), and $\Phi$ is the covariances of the latent variables (F). If you make the variances equal to 1, and the covariances equal to zero, then $\Phi$ is an identity matrix and you can ignore it.

If you don't want to think about it in terms of matrix algebra, the implied covariance between two items is the product of the paths between them. So the covariance of $V1$ and $V2$ is $\lambda_1 \times \lambda_2$ and the covariance of $V1$ and $V4$ is $\lambda_1 \times cov(F1, F2) \times \lambda_2$. (I'm assuming that all latent variable variances are equal to 1.00, because that makes things easier.) This is what the equation showing $\Sigma(\gamma)$ is doing - but it's rather hard to see without looking at the path diagram at the same time.

So you find the values for the unknowns in the model to try to ensure that $\Sigma(\theta)$ and $S$ are as similar as possible. There are varies ways to measure similarity, but the most common is maximum likelihood (ML).

Find the distance $F_{ML}$

$F_{ML} = log(\Sigma(\theta)) + tr(S\Sigma(\theta)^{-1}) - log(\Sigma) - p$

Where $p$ is the number of variables in the model. If $\Sigma(\theta) = S$ then $F_{ML} = 0$.

You can calculate the $\chi^2$ statistic using:

$\chi^2 = F_{ML} \times (N - 1)$ where $N$ is the sample size. (Note that $N-1$ is used by legacy software such as LISREL, EQS, and Amos, which date back to the use of Wishart likelihood for sampling variability of covariance matrices; more recent software like Mplus and lavaan use normal likelihood and multiply $F_{ML}$ by $N$.) Most of the other fit indices are derived in some way from these values (obvious exceptions are the GFI, which no one uses any more, and SRMR, which is based on the difference between $S$ and $\Sigma$).

You can play around with this using whatever program you like that can do matrix algebra and has an iterative solver. Quite some time ago, I wrote a paper showing how to do this with MS Excel, which you can find here: https://link.springer.com/content/pdf/10.3758/BF03192739.pdf. It's been a while since I used Excel, so I dont' know if this still works, but here's the sheet.

Related Question