Regression – Comparison of Regression in SEM Programs vs SPSS

regressionstructural-equation-modeling

Contex

Multiple regression with continuous variables.

Conventional statistical packages, e.g. SPSS, lm() in R, typically give me an F value, dfs, and a significance test on whether the model is performing well. I can also use ANOVA to compare whether one model is significantly better than another by finding out whether the increase in R square is significant.

Structural equation modeling programs, e.g. Mplus, lavaan package in R, also allow me to run regression with added benefits of missing data handling through full information maximum likelihood. Even though parameter estimates and R squares are similar between these two approaches, I do not get the F value and the significance testing of model doing regression in SEM style. Instead, I get model fit information with df being 0, and I cannot figure out how I could test whether one regression model is better than another.

Questions

  1. What are the differences between these two approaches, if any?
  2. In SEM style, how do I know a regression model is good, and how can I compare whether one regression model has significantly larger R square than another?

Best Answer

I think doing regression via SEM is bogus. I mean, it is cute to show that you can express a linear regression as a special case of SEM, just to show how general SEMs are, but doing regression with an SEM is a waste of time, as this approach does not utilize the many advances in regression modeling specific to linear models. This is the right tool for the job issue: if nothing else is at hand, I would hammer a nail into a drywall with the screwdriver by holding the latter at a sharp end and hitting the nail with the handle, but I won't recommend doing that, in general.

In SEM, you model the covariance matrix of everything: the regressors and the dependent variable. The covariance matrix of the regressors have to be unconstrained. The covariances between the dependent variable and the regressors is what generates the coefficient estimates, and the variance of the dependent variable, the $s^2$. So you've utilized all the degrees of freedom (number of covariance matrix entries), and that's why you see a zero. You should still be able to find $R^2$ in your output, but it will be hidden deeply somewhere, not thrown at you as in regression output: from the point of view of SEMs, your dependent variable is nothing striking, you may have a few dozen regressions in your output, and you can get all of their $R^2$s, or reliabilities, somewhere, but you may have to ask for it with some TECH options in Mplus.

The missing value stuff is even greater bogus. Typically, you have to assume some sort of a distribution, such as multivariate normal, to run a full information maximum likelihood. This is very doubtful for most applications, e.g. when you have dummy explanatory variables.

The advantage of doing regression properly with R or Stata is that you will have access to all the traditional diagnostics (residuals, leverage, etc. on influence; collinearity, nonlinearity and other goodness of fit issues), as well as additional tools for better inference (sandwich estimator that can be made robust for heteroskedasticity, cluster correlation or autocorrelation). SEM can offer "robust" standard errors, too, but they don't work well when the model is structurally misspecified (and that's what one of my papers was about).