Solved – PCA and exploratory Factor Analysis on the same dataset: differences and similarities; factor model vs PCA

factor analysispca

I would like to know if it makes any logical sense to perform principal component analysis (PCA) and exploratory factor analysis (EFA) on the same data set. I have heard professionals expressly recommend:

  1. Understand what the goal of analysis is and choose PCA or EFA for the data analysis;
  2. Having done one analysis there is no need to do the other analysis.

I understand the motivational differences between the two, but I was just wondering if there is anything wrong in interpreting the results provided by PCA and EFA at the same time?

Best Answer

Both models - principal-component and common-factor - are similar straightforward linear regressional models predicting observed variables by latent variables. Let us have centered variables V1 V2 ... Vp and we chose to extract 2 components/factors FI and FII. Then the model is the system of equations:

$V_1 = a_{1I}F_I + a_{1II}F_{II} + E_1$

$V_2 = a_{2I}F_I + a_{2II}F_{II} + E_2$

$...$

$V_p = …$

where coefficient a is a loading, F is a factor or a component, and variable E is regression residuals. Here, FA model differs from PCA model exactly by that FA imposes the requirement: variables E1 E2 ... Ep (the error terms which are uncorrelated with the Fs) must not correlate with each other (See pictures) . These error variables FA calls "unique factors"; their variances are known ("uniquenesses") but their casewise values are not. Therefore, factor scores F are computed as good approximations only, they are not exact.

(A matrix algebra presentation of this common factor analysis model is in Footnote $^1$.)

Whereas in PCA the error variables from predicting different variables may freely correlate: nothing is imposed on them. They represent that "dross" we've taken the left-out p-2 dimensions for. We know the values of E and so we can compute component scores F as exact values.

That was the difference between PCA model and FA model.

It is due to that above outlined difference, that FA is able to explain pairwise correlations (covariances). PCA generally cannot do it (unless the number of extracted components = p); it can only explain multivariate variance$^2$. So, as long as "Factor analysis" term is defined via the aim to explain correlations, PCA is not factor analysis. If "Factor analysis" is defined more broadly as a method providing or suggesting latent "traits" which could be interpreted, PCA can be seen is a special and simplest form of factor analysis.

Sometimes - in some datasets under certain conditions - PCA leaves E terms which almost do not intercorrelate. Then PCA can explain correlations and become like FA. It is not very uncommon with datasets with many variables. This made some observers to claim that PCA results become close to FA results as data grows. I don't think it is a rule, but the tendency may indeed be. Anyway, given their theoretical differences, it is always good to select the method consciously. FA is a more realistic model if you want to reduce variables down to latents which you're going to regard as real latent traits standing behind the variables and making them correlate.

But if you have another aim - reduce dimensionality while keeping the distances between the points of the data cloud as much as possible - PCA is better than FA. (However, iterative Multidimensional scaling (MDS) procedure will be even better then. PCA amounts to noniterative metric MDS.) If you further don't bother with the distances much and are interested only in preserving as much of the overall variance of the data as possible, by few dimensions - PCA is an optimal choice.


$^1$ Factor analysis data model: $\mathbf {V=FA'+E}diag \bf(u)$, where $\bf V$ is n cases x p variables analyzed data (columns centered or standardized), $\bf F$ is n x m common factor values (the unknown true ones, not factor scores) with unit variance, $\bf A$ is p x m matrix of common factor loadings (pattern matrix), $\bf E$ is n x p unique factor values (unknown), $\bf u$ is the p vector of the unique factor loadings equal to the sq. root of the uniquenesses ($\bf u^2$). Portion $\mathbf E diag \bf(u)$ could be just labeled as "E" for simplicity, as it is in the formulas opening the answer.

Principal assumptions of the model:

  • $\bf F$ and $\bf E$ variables (common and unique factors, respectively) have zero means and unit variances; $\bf E$ is typically assumed multivariate normal but $\bf F$ in general case needs not be multivariate normal (if both are assumed multivariate normal then $\bf V$ are so, too);
  • $\bf E$ variables are uncorrelated with each other and are uncorrelated with $\bf F$ variables.

$^2$ It follows from the common factor analysis model that loadings $\bf A$ of m common factors (m<p variables), also denoted $\bf A_{(m)}$, should closely reproduce observed covariances (or correlations) between the variables, $\bf \Sigma$. So that if factors are orthogonal, the fundamental factor theorem states that

$\bf \hat{\Sigma} = AA'$ and $\bf \Sigma \approx \hat{\Sigma} + \it diag \bf (u^2)$,

where $\bf \hat{\Sigma}$ is the matrix of reproduced covariances (or correlations) with common variances ("communalities") on its diagonal; and unique variances ("uniquenesses") - which are variances minus communalities - are the vector $\bf u^2$. The off-diagonal discrepancy ($\approx$) is due to that factors is a theoretical model generating data, and as such it is simpler than the observed data it was built on. The main causes of the discrepancy between the observed and the reproduced covariances (or correlations) may be: (1) number of factors m is not statistically optimal; (2) partial correlations (these are p(p-1)/2 factors that do not belong to common factors) are pronounced; (3) communalities not well assesed, their initial values had been poor; (4) relationships are not linear, using linear model is questionnable; (5) model "subtype" produced by the extraction method is not optimal for the data (see about different extraction methods). In other words, some FA data assumptions are not fully met.

As for plain PCA, it reproduces covariances by the loadings exactly when m=p (all components are used) and it usually fails to do it if m<p (only few 1st components retained). Factor theorem for PCA is:

$\bf \Sigma= AA'_{(p)} = AA'_{(m)} + AA'_{(p-m)}$,

so both $\bf A_{(m)}$ loadings and dropped $\bf A_{(p-m)}$ loadings are mixtures of communalities and uniquenesses and neither individually can help restore covariances. The closer m is to p, the better PCA restores covariances, as a rule, but small m (which often is of our interest) don't help. This is different from FA, which is intended to restore covariances with quite small optimal number of factors. If $\bf AA'_{(p-m)}$ approaches diagonality PCA becomes like FA, with $\bf A_{(m)}$ restoring all the covariances. It happens occasionally with PCA, as I've already mentioned. But PCA lacks algorithmic ability to force such diagonalization. It is FA algorithms who do it.

FA, not PCA, is a data generative model: it presumes few "true" common factors (of usually unknown number, so you try out m within a range) which generate "true" values for covariances. Observed covariances are the "true" ones + small random noise. (It is due to the performed diagonalization that leaved $\bf A_{(m)}$ the sole restorer of all covariances, that the above noise can be small and random.) Trying to fit more factors than optimal amounts to overfitting attempt, and not necessarily efficient overfitting attempt.

Both FA and PCA aim to maximize $trace(\bf A'A_{(m)})$, but for PCA it is the only goal; for FA it is the concomitant goal, the other being to diagonalize off uniquenesses. That trace is the sum of eigenvalues in PCA. Some methods of extraction in FA add more concomitant goals at the expense of maximizing the trace, so it is not of principal importance.

To summarize the explicated differences between the two methods. FA aims (directly or indirectly) at minimizing differences between individual corresponding off-diagonal elements of $\bf \Sigma$ and $\bf AA'$. A successful FA model is the one that leaves errors for the covariances small and random-like (normal or uniform about 0, no outliers/fat tails). PCA only maximizes $trace(\bf AA')$ which is equal to $trace(\bf A'A)$ (and $\bf A'A$ is equal to the covariance matrix of the principal components, which is diagonal matrix). Thus PCA isn't "busy" with all the individual covariances: it simply cannot, being merely a form of orthogonal rotation of data.

Thanks to maximizing the trace - the variance explained by m components - PCA is accounting for covariances, since covariance is shared variance. In this sense PCA is "low-rank approximation" of the whole covariance matrix of variables. And when seen from the viewpoint of observations this approximation is the approximation of the Euclidean-distance matrix of observations (which is why PCA is metric MDS called "Principal coordinate analysis). This fact should not screen us from the reality that PCA does not model covariance matrix (each covariance) as generated by few living latent traits that are imaginable as transcendent towards our variables; the PCA approximation remains immanent, even if it is good: it is simplification of the data.


If you want to see step-by-step computations done in PCA and FA, commented and compared, please look in here.