Solved – Use of factor analysis + regression

confirmatory-factorfactor analysisregression

Independent Variable: I have a survey of 50 states indicating the amount of control the state board of education has in 31 areas answered on a three point scale (1 = total control; 2 = partial control; 3 = no control). I have a solid theoretical underpinning for all 31, or to be more precise, the literature review found evidence for all 31 as being important (Study X found items 1, 4 and 7; Study Y found items 2, 9, and 11, etc.)

Dependent variable: % of students graduating HS within 4 years.

What I would like to do is the following:

Use factor analysis (SPSS) to reduce the 31 down to no more than 4
to 6 variables.
Using the results from 1), run a regression vs. the % of students graduating HS within 4 years.

Is there some sort of step by step guide somewhere on how to do this?

Thanks.

UPDATE

Ok, this is what I have done and I believe it is correct (any confirmation would be greatly appreciated) using SPSS 18:

In SPSS Analyze -> Dimension Reduction -> Factor
Descriptives: Initial Solution
Extraction: Method = Principal components; Analyze = Correlation matrix; Display = Unrotated factor solution and Scree plot; Extract: Based on Eigenvalue greater than 1; Maximum Iterations for Convergence = 25
Rotation: Method = Varimax; Display = Rotated Solution and Loading Plots; Maximum Iterations for Convergence = 25
Scores: Save as variables; Method = Regression; Display factor score coefficient matrix
Options: Exclude cases listwise; Suppress small coefficients [with] absolute value below. 10

The result are 9 saved columns (FAC1_1, FAC1_2, FAC1_3…FAC1_9) in the SPSS sheet.

The Total Variance Explained -> Rotation Sums of Squared Loadings indicates that the first 5 of these explain 51.51% of the variance.

enter image description here

So, should I then go back into SPSS run a linear regression (Analyze -> Regression -> Linear) with the Dependent Variable % of students graduating HS within 4 years and the Independent Variables being FAC1_1, FAC1_2, FAC1_3, FAC1_4, and FAC1_5?

Best Answer

The problem that I see with your question is as follows:

31 is not a VERY large number of variables, at least not so large that you could not by-hand cluster similar variables into 4 or 5 latent variables using sum-scores, as you aim to do. This should give very approximately similar results to the factor analysis. If it doesn't, I would trust the by-hand scores more. The benefit of doing this is:

Scoring is done by nature of the research question, not the structure of the collected data.
The usual assumptions and very large "p" of data mining hardly apply here so the structure of the data is dubious to begin with. I am not confident that a number of "orthogonal" components would summarize something that school board educators would be interested in.
0 reproducibility error. Very easy to replicate and understand results. Could potentially benchmark and compare results between districts.
People reviewing such an analysis will agree that, while the measure may not be perfect, it should have good power to go about conducting a confirmatory factor analysis.

I am not advocating that you should inspect, say, a heirarchical clustering and/or heatmap or use other analyses to show the interdependence of variables, and/or that you shouldn't try to, say, run a univariate factor analysis and create latent varaible scores using these as independent predictors in a regression model (note that the standard errors here aren't correct because they don't account for uncertainty in the scores). These types of analyses can help to better understand the confirmatory analysis above.

Related Solutions

Solved – “Two stage” factor Analysis: factoring saved factor scores

Of course, it's "possible" to do what you're asking. The question is whether or not this is the best way to deal with the issue. You have left out mention of a number of important considerations: first, did you rotate a PCA to create a CFA with 3 factors? That you've noted "cfa" as a keyword, suggests rotation. To me, this means "common factor analysis." Is that correct?

One thing that often gets ignored about unrotated PCA is that it results in a mathematically unique solution where the first factor has been called a "junk" factor by some academics insofar as everything loads on it. Rotation cancels uniqueness by adjusting the loadings across the retained factors to something called "simple structure." The goal of simple structure is that each variable load on a single factor only and be zero (or close to it) for the other factors. Given that, have you examined the first, unrotated PCA component for its value wrt your objective?

Next, factor analysis results in a set of linear combinations that recover a reduced percentage of the total variance. A second, higher-order factor analysis would reduce the recovered variance even more.

Finally, if you want to get really geeky, check out the literature on additive and ultrametric trees for a good discussion of second-order factor analysis. This is not an area that's seen much recent research that I'm aware of but there's a Sage book with this title by James Corter that dates back 25 years or so.

In my opinion, leveraging the first PC would be a safe, easy solution.

Solved – Calculating variance explained by factors after exploratory factor analysis with oblique rotation in R

I do not know what is usually reported in papers using oblique factor analysis. However, this is what I would do, as in this case at least I know exactly what I am reporting and this makes sense to me.

To compute the percentage of variance of an individual variable, explained by a given factor, one can compute the squares of structure loadings. If we sum this by all variables, we get the sum of the variances (SS loadings) of all variables explained by a given factor. This is also what is computed by SPSS. If we divide this by the sum of all variances of the variables (equal to the number of variances in cased of standardized variables - that is the case always when using correlation matrix, as fa from psych does by default), we get the share/% of explained variance by individual factors. I think you can report that, just make sure that you do not sum this together by factors, which does not make sense when factors are correlated. In addition to that, I would report the % of variance explained by all factors. This can be (in case of use of correlations/standardized variables) computed as mean communality or is actually the same as the total percentage outputted by the print method for the object returned by fa.

Here is how to compute all this based on the efa object from opening question.

# Compute SS loadings
SS<-colSums(efa$Structure^2)
# Compute percentage of explained variance by factor
SS/length(efa$communality)
# Total explained variability
mean(efa$communality)
# WRONG - comulative percantages !!!
cumsum(SS/length(efa$communality))

Just a note. I thing some things have changed in the psych package since my answer that the OP is citing, although using the mean communality is still ok.

Best Answer

Related Solutions

Solved – “Two stage” factor Analysis: factoring saved factor scores

Solved – Calculating variance explained by factors after exploratory factor analysis with oblique rotation in R

Related Question