Solved – Path Analysis assumptions: endogenous variables cannot share error covariance

covariancepath-modelstructural-equation-modeling

In this website: http://mb3is.megx.net/gustame/constrained-analyses/path-analysis

"If variables are believed to share extraneous variables, they should be considered as exogenous variables rather than endogenous variables."

Does path analysis really have such assumption? If so, why?

I have this question because I would like to predict two outcomes (math at age 5, math at age 6) by four predictors in my path model, and I believe the two outcomes have shared residual covariance.

Best Answer

I think the author of that link is being a bit severe. The default, in structural equation modeling, is for residual variances to be uncorrelated, but you can posit correlations between them if you think it makes sense to do so.

However, residual correlations will weaken the wow factor of your model. Presumably, you are saying that math at age 6 is related to math at age 5 and 4 predictors ... let's say gender, grade 1 teacher, parental homework involvement, hours of computer games. What would it mean if the residual variances are dependent? How could that happen? One way is if there is another variable kicking around like, say, attendance at a tutorial program, which some kids have and others don't - and which influences math at age 5 and 6, but which you have not included in the model. This is the "extraneous variable" the author is talking about.

So your research hypothesis is now: math at age 6 depends on math at age 5 plus 4 variables I have measured, plus a bunch of stuff I didn't think about. This is probably true in most educational research, but not the stuff of riveting publications.

You didn't specify exactly what your model is going to look like. Do the "four predictors" just feed into "math at age 5"? Or do they feed in to both time points? If they represent realities that persist from age 5 to 6, you might want to feed them in to both outcomes. For example, perhaps the grade 1 teacher is good because the school is good and the kindergarten teacher is also good, meaning that a school effect should feed into outcome at age 5 and outcome at age 6. If you just feed the 4 predictors into one of the math outcomes, you will get a non-zero residual covariance from the predictor effect that was not properly included in the model specification.

Related Solutions

Solved – Interpretation of Residual Covariance

It's a partial correlation. It represents covariance (or correlation) between the factors that is not explained by the predictors. It means that there are common causes that you have not included, or that the two factors are causally related.

There's no cutoff for statistical non-significance, other than the cutoff that you usually use (i.e. p < 0.05). You should probably leave it in, because (a) you don't care, (b) you're getting a degree of freedom for free if you take it out only when it's non-sig, and (c) if you take it out you are hypothesizing that you have included in your model every common cause of those two factors - and that seems unlikely.

Solved – multiple group model vs moderated regression

I don't know of a paper. It's the sort of thing that's pretty clear if you know about SEM, and that's why no one writes it. If anyone did write it and send it to a journal, reviewers would say "This is obvious."
1. The two methods can give identical results, if you add the correct constraints to the SEM. The SEM approach makes fewer assumptions about homogeneity of variance, so it might be preferable. To make the models equivalent you need to add constraints. (The variance of y can be different in the two groups in the SEM approach, it can't in the multilevel approach.

Here's an example (using Lavaan, in R). Everything in the regression can be seen in the lavaan output.

> library(lavaan)
> set.seed(1234)
> df <- data.frame(x = rnorm(1000))
> df$m <- c(rep(0, 500), rep(1, 500))
> df$y <- df$x + rnorm(1000) + df$m + df$m * df$x + rnorm(1000)
> 
> summary(lm(y ~ x + m + x * m, data = df))

Call:
lm(formula = y ~ x + m + x * m, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4406 -0.9659  0.0093  0.9167  4.4616 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.08244    0.06193   1.331    0.183    
x            1.06196    0.05991  17.727   <2e-16 ***
m            0.92634    0.08766  10.568   <2e-16 ***
x:m          1.01805    0.08815  11.548   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.385 on 996 degrees of freedom
Multiple R-squared:  0.5902,    Adjusted R-squared:  0.5889 
F-statistic: 478.1 on 3 and 996 DF,  p-value: < 2.2e-16

> 
> mod.1 <- "
+   y ~ c(a, b) * x
+   y ~~ c(v1, v1) * y  # This step needed for exact equivalence
+   y ~ c(int1, int2) * 1
+ 
+   modEff := a - b
+   mEff := int1 - int2
+ "
> 
> fit.1 <- sem(mod.1, data = df,
+                 group = "m")
> summary(fit.1)
lavaan (0.5-18) converged normally after  15 iterations

  Number of observations per group         
  0                                                500
  1                                                500

  Estimator                                         ML
  Minimum Function Test Statistic                0.499
  Degrees of freedom                                 1
  P-value (Chi-square)                           0.480

Chi-square for each group:

  0                                              0.244
  1                                              0.255

Parameter estimates:

  Information                                 Expected
  Standard Errors                             Standard

Group 1 [0]:

                   Estimate  Std.err  Z-value  P(>|z|)
Regressions:
  y ~
    x         (a)     1.062    0.060   17.762    0.000

Intercepts:
    y      (int1)     0.082    0.062    1.334    0.182

Variances:
    y        (v1)     1.910    0.085



Group 2 [1]:

                   Estimate  Std.err  Z-value  P(>|z|)
Regressions:
  y ~
    x         (b)     2.080    0.065   32.228    0.000

Intercepts:
    y      (int2)     1.009    0.062   16.294    0.000

Variances:
    y        (v1)     1.910    0.085


Defined parameters:
    modEff           -1.018    0.088  -11.572    0.000
    mEff             -0.926    0.087  -10.589    0.000

(Actually I didn't do the final step of constraining the model and testing with the anova() function, I just used the difference - that's left as an exercise for the reader, but the result will be the same p-value and no parameter estimate and standard error.)

Chi-square gives a p-value. If you don't like the p-value from the chi-square, you don't like the p-value from the regression model. It's the same estimate of the interaction effect, same standard error, same p-value, whichever method you use.
Not really. But you can, and it's easy to relax the assumption. In regression you make the assumption and you're stuck with it. In SEM the assumption is automatically tested. I added the constraint to the above model, and so it has 1 df. The chi-square test is not significant, so I don't have evidence that the assumption was violated. But you don't need to put in that constraint, so I suspect most people wouldn't.

Best Answer

Related Solutions

Solved – Interpretation of Residual Covariance

Solved – multiple group model vs moderated regression

Related Question