Solved – EFA on one part of the dataset and CFA/SEM on another part of the dataset

confirmatory-factorcross-validationfactor analysisstructural-equation-modeling

Assuming I split my dataset (n = 650), for the purpose of performing exploratory factor analysis on half of the data, and then confirming the extractor factor structure using confirmatory factor analysis [CFA]… If I wanted to perform further analysis after CFA (i.e. mediation/moderation, further structural equation modelling), are there any recommendations regarding which data set to use? Would I be limited to the dataset used for CFA, or could I use my entire dataset?

There have been a number of previous CrossValidated questions in this vein [1, 2, 3, 4], but they all seem to have stopped after CFA.

Best Answer

I believe you should do the structural equation modeling on the second half of the dataset.

As you say in your question, the basic process is: You split the dataset, and the first half you do the EFA on. This is where you explore the data and get a feel for how the structure shapes up. But who knows if this is just due to artifacts of the data you had? So you move on to the second half of the dataset. This is where you specify the structure you got from the EFA and see if it fits these other data well (i.e., doing a CFA).

Now, this is what most people do, because they are interested in investigating the psychometric properties of a scale (factor). I always see this in papers on scale validation.

But you are also interested in relationships between the factor and other variables. I think you should do structural equation modeling in the second half of the data; CFA is really just a part of fitting a SEM.

In any SEM, you are first going to specify a measurement model (i.e., do a CFA) to make sure that the latent variables fit before going on to modeling relationships between them. It's like building from the ground-up: First, let's make sure that our latent variables are solid building blocks before trying to build a model out of those blocks. So, in a sense, if you do the SEM, you are really doing a CFA at the same time — instead, now just a little bit more.

So with the second half of the dataset, I would write it up as such:

  1. Specify and run the entire model.
  2. When writing it up, first focus on the measurement model of the factor that you are interested in and did the EFA on with the first half of the dataset. You could even report fit statistics for just this part of the model.
  3. Then, talk about the structural relationships between this latent variable and other variables.

Basically, you can do an EFA in the first half and a SEM in the second half, but give special attention to the measurement model part (i.e., the CFA) for the factor of interest, because doing this CFA is part of the structural equation model itself. There's no need to think of it as EFA $\rightarrow$ CFA $\rightarrow$ SEM. Think of CFA as part of specifying a SEM.