Solved – Whether to leave the data unaltered in the face of outliers and non-normality when performing structural equation modelling

factor analysisnormality-assumptionoutliersstructural-equation-modeling

I recently received this email from a graduate student, and I get similar questions often enough, that I thought I'd post it here:

I'm using factor analysis, multiple regression, and SEM and currently
checking statistical assumptions. I have found numerous univariate
and multivariate outliers. If I deleted them all, it would mean a
large chunk out of my sample size ($N \approx 350$). I also have
problems with non-normality, non-linearity, heteroscedasticity
(Multiple regression), and large standardised residual covariances
(SEM).

I have tried reducing the influence of the outliers (allocating them a
value one unit larger/smaller than the next most extreme non-outlier
value), and transformations (mostly the variables remained skewed and
some outliers remain). When I compare original results with altered
data, there is little effect. Given this, I am wondering whether it
would be acceptable to leave the data as it is? I'm inclined to,
particularly because this data is from a non-clinical population and I
have used clinical measures.

Best Answer

A lot depends on where exactly the outliers occur within the model -- in the indicators? in the latent variables and their measurement errors? in the exogenous variables, at the top of the causal chain? In the former case, you cannot do much, as you really have a high leverage influential cases rather than outliers. To control for outliers in the indicators/response variables, you need to work at the equation level, like Moustaki and Victoria-Feser (2006) did. Shooting at it with the robust covariance matrices may or may not be the right thing to do. I am referring here to the recent work by Ke-Hai Yuan and Zhiyong Zhang of Notre Dame who tried to revive robust estimation methods as applied to structural equation modeling -- see e.g. their R package rsem (that seems to rely on having EQS as the estimation engine though, which is weird given the variety of choices within R). They've been publishing like crazy on this in the past five or so years; I've reviewed at least three papers for various journals, and frankly I am at a loss which one is to be recommended, as they all repeat each other. I have not seen this used much in applied work, although it probably should be; may be you'd be the trendsetter!

A great diagnostic tool is the forward search method developed by Atkinson and Riani of LSE (for regression and multivariate data). This has been adopted for SEM here and here. I personally think this is really neat, but whether it could catch up in the SEM community at large, I don't know.

Frontiers in Quant Psy published a review paper on this in early 2012. Even though I am the acknowledged reviewer of this work, I am extremely reluctant to really recommend it (it barely passed my threshold of publishable work, and I simply gave up explaining the theory of robust statistics in my referee letters), but I am just not aware of anything better.

Related Solutions

Solved – Bootstrapped parameter and fit estimates with non-normality for structural equation models

The following are just a few points:

If you have departure from normality then bootstrapping is often a good idea.
You mention using "1000" replicates. Increasing the number of replicates increases computational time and accuracy. Thus, sometimes when first setting up your model, you'll set the number of replicates at a level that is relatively quick to run. However, for your final model that you report, you may want to push up the number of replicates to 10,000 or more.
If the departure of your data from normality is mild, then coefficient and model fit tests that assume normality are often a reasonable approximation. In particular when you have a big sample, as is often the case with structural equation modelling, assumption tests that perform a significant test with the null hypothesis as normality often are overly sensitive for the purpose of deciding whether to persist with methods that assume normality. I would pay more attention to the actual indices of non-normality like skewness and kurtosis values (or if your intuition is sufficiently trained, check out histograms of the variables).
If the departure from normality is mild, I would expect that both standard and bootstrapped approaches should yield similar results. Showing that your results are robust to such analytic decisions may provide you with greater confidence in your results.

Solved – Whether to use structural equation modelling to analyse observational studies in psychology

My disclaimer: I realize this question has laid dormant for some time, but it seems to be an important one, and one that you intended to elicit multiple responses. I am a Social Psychologist, and from the sounds of it, probably a bit more comfortable with such designs than Henrik (though his concerns about causal interpretations are totally legitimate).

Under What Conditions Is SEM An Appropriate Data Analysis Technique?

To me, this question actually actually gets at two distinct sub-questions:

Why use SEM in the first place?
If a researcher has decided to use SEM, what are the data-related requirements for using SEM?

Why use SEM in the first place?

SEM is a more nuanced and complicated--and therefore less accessible--approach to data analysis than other, more typical, general linear modelling approaches (e.g., ANOVAs, correlations, regression, and their extensions, etc.,). Anything you can think of doing with those approaches, you can do with SEM.

As such, I think would-be users should first strongly evaluate why they are compelled to use SEM in the first place. To be sure, SEM offers some powerful benefits to its users, but I have reviewed papers in which none of these benefits are utilized, and the end-product is a data analysis section in a paper that is needlessly more difficult for typical readers to understand. It's just simply not worth the trouble--for the researcher, or the reader--if the benefits of SEM vs. other data analysis approaches are not being reaped.

So what do I see as the primary benefits of an SEM approach? The big ones, in my opinion are:

(1) Modeling latent variables: SEM allows users to examine structural relations (variances, covariances/correlations, regressions, group mean differences) among unobserved latent variables, which are essentially the shared covariance between a group of variables (e.g., items from an anxiety measure your students might use).

The big selling point for analyzing latent variables (e.g., latent anxiety) vs. an observed score of the construct (e.g., an average of the anxiety items) is that latent variables are error-free--latent variables are formed of shared covariance, and error is theorized to covary with nothing. This translates to increased statistical power, as users no longer have to worry about measurement unreliability attenuating the effects they are trying to model.

Another, more understated, reason to consider using SEM is in some cases it is a more construct-valid way of testing our theories about constructs. If your students, for example, were using three different measures of anxiety, wouldn't it be better to understand the causes/consequences of what those three measures have in common--presumably anxiety-- in an SEM framework, instead of privileging any particular one measure as the measure of anxiety?

(2) Modeling multiple dependent variables: Even if someone isn't going to use SEM to model latent variables, it can still be quite useful as a framework for simultaneously analyzing multiple outcome variables in one model. For example, perhaps your students are interested in exploring how the same predictors are associated with a number of different clinically relevant outcomes (e.g., anxiety, depression, loneliness, self-esteem, etc.,). Why run four separate models (increasing Type I error rate), when you can just run one model for all four outcomes that you are interested in? This is also a reason to use SEM when dealing with certain types of dependent data, where multiple, dependent respondents might both yield predictor and outcome responses (e.g., dyadic data; see Kenny, Kashy, and Cook, 2006, for a description of the SEM approach to using the Actor-Partner Interdependence Model [APIM]).

(3) Modeling assumptions, instead of making them: With many other approaches to data analysis (e.g., ANOVA, correlation, regression), we make a ton of assumptions about the properties of the data we are dealing with--such as homogeneity of variance/homoskedasticity. SEM (usually combined with a latent variable approach) enables users to actually model variance parameters simultaneously alongside means and/or correlations/regressive pathways. This means that users can begin theorizing about and testing hypothesis about variability, in addition to mean differences/covariability, instead of just treating variability as an annoying assumption-related afterthought.

Another testable assumption, when comparing group mean levels on some variable, is whether that variable actually means the same thing to each group--referred to as measurement invariance in the SEM literature (see Vandenberg & Lance, 2000, for a review of this process). If so, then comparisons on mean levels of that variable are valid, but if groups have a significantly different understanding of what something is, comparing mean levels between groups is questionable. We make this particular assumption implicitly all the time in research using group-comparisons.

And then there is the assumption, that when you average or sum item scores (e.g., on an anxiety measure) to create an aggregate index, that each item is an equally good measure of the underlying construct (because each item is weighted equally in the averaging/summing). SEM eliminates this assumption when latent variables are used, by estimating different factor loading values (the association between the item and the latent variable) for each item.

Lastly, other assumptions about the data (e.g, normality), though still important for SEM, can be managed (e..g, through the use of "robust" estimators, see Finney & DiStefano, 2008) when the data fail to meet certain criteria (low levels of skewness and kurtosis).

(4) Specifying model constraints: The last big reason, in my opinion, to consider using SEM, is because it makes it very easy to test particular hypotheses you might have about your model of data, by forcing ("constraining" in SEM terms) certain paths in your model to take on particular values, and examining how that impacts the fit of your model to your data. Some examples include: (A) constraining a regression pathway to zero, to test whether it's necessary in the model; (B) containing multiple regression pathways to be equal in magnitude (e.g., is the associative strength for some predictor roughly equal for anxiety and depression?); (C) constraining the measurement parameters necessary to evaluate measurement invariance (described above); (D) constraining a regression pathway to be equal in strength between two different groups, in order to test moderation by group.

What are the data-related requirements for SEM?

The data-related requirements for SEM are pretty modest; you need an adequate sample size, and for your data to meet the assumptions of the model estimator you have selected (Maximum-Liklihood is typical).

It is difficult to give a one-size-fits-all recommendation for sample size. Based on some straightforward simulations, Little (2013) suggests that for very simple models, 100-150 observations might be enough, but sample size needs will increase as models become more complex, and/or as the reliability/validity of the variables used in the model decreases. If model complexity is a concern, you could consider parcelling the indicators of your latent variables, but not all are onboard with this approach (Little, Cunningham, Shahar, & Widaman, 2002). But generally speaking, all else being equal, bigger samples (I strive for 200 minimum in my own research) are better.

As for meeting the assumptions of a selected estimator, usually this is pretty easy to assess (e.g., look at skewness and kurtosis values for a maximum likelihood estimator). And even if data depart from assumed properties, a research could consider the use of a "robust" estimator (Finney & DiStefano, 2008), or an estimator that assumes a different kind of data (e.g., a categorical estimator, like diagonally weighted least squares).

Alternatives To SEM for Data Analysis?

If a researcher isn't going to take advantage of the benefits provided by an SEM approach that I've highlighted above, I'd recommend sticking to the more straight-forward and accessible version of that particular analysis (e..g, t-tests, ANOVAs, correlation analysis, regression models [including mediation, moderation, and conditional process models]). Readers are more familiar with them, and will therefore more easily understand them. It's just not worth confusing readers with the minutiae of SEM if you're essentially using SEM to the same effect as a simpler analytic approach.

Advice to Researchers Considering The Use Of SEM?

For those brand new to SEM:

Get a comprehensive, accessibly-written foundation SEM text. I like Beaujean (2014), Brown (2015; the earlier edition is solid too), and Little (2013; good overall introduction, even though it later focuses specifically on longitudinal models).
Learn how to use the lavaan package for R (Rosseel, 2012). It's syntax is as easy as SEM syntax can get, it's functionality is broad enough for many folks' SEM needs (definitely for beginners), and it's free. The Beaujean book gives a great simultaneous introduction to SEM and the lavaan package.
Consult/use CrossValidated and StacksOverflow regularly. Unexpected things can happen when fitting SEM models, and chances are, many of the weird things you might experience have already been described and troubleshot on Stacks.
As Herik points out, note that just because you are specifying a model that implies causal associations, it does not mean that SEM helps to establish causality in a cross-sectional/non-experimental study. Also, it's totally worth considering the use of SEM to analyze data from longitudinal and/or experimental designs.

And for those who are beginning to actually use SEM:

You will, at some point, be tempted to specify correlated residuals willy-nilly, in an effort to improve the fit of your model. Don't. At least not without a good a priori reason. More often than not, a larger sample, or a simpler model is the cure.
Avoid the use of the marker-variable method of identification for latent variables (i.e., fixing the first factor loading to 1). It privileges that indicator as the "gold-standard" indicator of your latent variable, when in most cases, there is no reason to assume this is the case. Be aware that this is the default identification setting in most programs.

References

Beaujean, A. A. (2014). Latent variable modeling using R: A step-by-step guide. New York, NY: Routledge.

Brown, T. A. (2015). Confirmatory factor analysis for applied researchers (2nd edition). New York, NY: Guilford Press.

Finney, S. J., & DiStefano, C. (2008). Non-normal and categorical data in structural equation modeling. In G. R. Hancock & R. D. Mueller (Eds.), Structural equation modeling: A second course (pp. 269-314). Information Age Publishing.

Kenny, D. A., Kashy, D. A., & Cook, W. L. (2006). Dyadic data analysis. New York, NY: Guilford Press.

Little, T. D. (2013). Longitudinal structural equation modeling. New York, NY: Guilford Press.

Little, T. D., Cunningham, W. A., Shahar, G., & Widaman, K. F. (2002). To parcel or not to parcel: Exploring the question, weighing the merits. Structural Equation Modeling, 9, 151-173.

Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1-36.

Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational researchers. Organizational Research Methods, 3, 4-70.

Best Answer

Related Solutions

Solved – Bootstrapped parameter and fit estimates with non-normality for structural equation models

Solved – Whether to use structural equation modelling to analyse observational studies in psychology

Related Question