Solved – ny good reason to use PCA instead of EFA? Also, can PCA be a substitute for factor analysis

exploratory-data-analysisfactor analysispca

In some disciplines, PCA (principal component analysis) is systematically used without any justification, and PCA and EFA (exploratory factor analysis) are considered as synonyms.

I therefore recently used PCA to analyse the results of a scale validation study (21 items on 7-points Likert scale, assumed to compose 3 factors of 7 items each) and a reviewer asks me why I chose PCA instead of EFA. I read about the differences between both techniques, and it seems that EFA is favored against PCA in a majority of your answers here.

Do you have any good reasons for why PCA would be a better choice? What benefits it could provide and why it could be a wise choice in my case?

Best Answer

Disclaimer: @ttnphns is very knowledgeable about both PCA and FA, and I respect his opinion and have learned a lot from many of his great answers on the topic. However, I tend to disagree with his reply here, as well as with other (numerous) posts on this topic here on CV, not only his; or rather, I think they have limited applicability.

I think that the difference between PCA and FA is overrated.

Look at it like that: both methods attempt to provide a low-rank approximation of a given covariance (or correlation) matrix. "Low-rank" means that only a limited (low) number of latent factors or principal components is used. If the $n \times n$ covariance matrix of the data is $\mathbf C$, then the models are:

\begin{align} \mathrm{PCA:} &\:\:\: \mathbf C \approx \mathbf W \mathbf W^\top \\ \mathrm{PPCA:} &\:\:\: \mathbf C \approx \mathbf W \mathbf W^\top + \sigma^2 \mathbf I \\ \mathrm{FA:} &\:\:\: \mathbf C \approx \mathbf W \mathbf W^\top + \boldsymbol \Psi \end{align}

Here $\mathbf W$ is a matrix with $k$ columns (where $k$ is usually chosen to be a small number, $k<n$), representing $k$ principal components or factors, $\mathbf I$ is an identity matrix, and $\boldsymbol \Psi$ is a diagonal matrix. Each method can be formulated as finding $\mathbf W$ (and the rest) minimizing the [norm of the] difference between left-hand and right-hand sides.

PPCA stands for probabilistic PCA, and if you don't know what that is, it does not matter so much for now. I wanted to mention it, because it neatly fits between PCA and FA, having intermediate model complexity. It also puts the allegedly large difference between PCA and FA into perspective: even though it is a probabilistic model (exactly like FA), it actually turns out to be almost equivalent to PCA ($\mathbf W$ spans the same subspace).

Most importantly, note that the models only differ in how they treat the diagonal of $\mathbf C$. As the dimensionality $n$ increases, the diagonal becomes in a way less and less important (because there are only $n$ elements on the diagonal and $n(n-1)/2 = \mathcal O (n^2)$ elements off the diagonal). As a result, for the large $n$ there is usually not much of a difference between PCA and FA at all, an observation that is rarely appreciated. For small $n$ they can indeed differ a lot.

Now to answer your main question as to why people in some disciplines seem to prefer PCA. I guess it boils down to the fact that it is mathematically a lot easier than FA (this is not obvious from the above formulas, so you have to believe me here):

PCA -- as well as PPCA, which is only slightly different, -- has an analytic solution, whereas FA does not. So FA needs to be numerically fit, there exist various algorithms of doing it, giving possibly different answers and operating under different assumptions, etc. etc. In some cases some algorithms can get stuck (see e.g. "heywood cases"). For PCA you perform an eigen-decomposition and you are done; FA is a lot more messy.

Technically, PCA simply rotates the variables, and that is why one can refer to it as a mere transformation, as @NickCox did in his comment above.
PCA solution does not depend on $k$: you can find first three PCs ($k=3$) and the first two of those are going to be identical to the ones you would find if you initially set $k=2$. That is not true for FA: solution for $k=2$ is not necessarily contained inside the solution for $k=3$. This is counter-intuitive and confusing.

Of course FA is more flexible model than PCA (after all, it has more parameters) and can often be more useful. I am not arguing against that. What I am arguing against, is the claim that they are conceptually very different with PCA being about "describing the data" and FA being about "finding latent variables". I just do not see this is as true [almost] at all.

To comment on some specific points mentioned above and in the linked answers:

"in PCA the number of dimensions to extract/retain is fundamentally subjective, while in EFA the number is fixed, and you usually have to check several solutions" -- well, the choice of the solution is still subjective, so I don't see any conceptual difference here. In both cases, $k$ is (subjectively or objectively) chosen to optimize the trade-off between model fit and model complexity.
"FA is able to explain pairwise correlations (covariances). PCA generally cannot do it" -- not really, both of them explain correlations better and better as $k$ grows.
Sometimes extra confusion arises (but not in @ttnphns's answers!) due to the different practices in the disciplines using PCA and FA. For example, it is a common practice to rotate factors in FA to improve interpretability. This is rarely done after PCA, but in principle nothing is preventing it. So people often tend to think that FA gives you something "interpretable" and PCA does not, but this is often an illusion.

Finally, let me stress again that for very small $n$ the differences between PCA and FA can indeed be large, and maybe some of the claims in favour of FA are done with small $n$ in mind. As an extreme example, for $n=2$ a single factor can always perfectly explain the correlation, but one PC can fail to do it quite badly.

Update 1: generative models of the data

You can see from the number of comments that what I am saying is taken to be controversial. At the risk of flooding the comment section even further, here are some remarks regarding "models" (see comments by @ttnphns and @gung). @ttnphns does not like that I used the word "model" [of the covariance matrix] to refer to the approximations above; it is an issue of terminology, but what he calls "models" are probabilistic/generative models of the data:

\begin{align} \mathrm{PPCA}: &\:\:\: \mathbf x = \mathbf W \mathbf z + \boldsymbol \mu + \boldsymbol \epsilon, \; \boldsymbol \epsilon \sim \mathcal N(0, \sigma^2 \mathbf I) \\ \mathrm{FA}: &\:\:\: \mathbf x = \mathbf W \mathbf z + \boldsymbol \mu + \boldsymbol \epsilon, \; \boldsymbol \epsilon \sim \mathcal N(0, \boldsymbol \Psi) \end{align}

Note that PCA is not a probabilistic model, and cannot be formulated in this way.

The difference between PPCA and FA is in the noise term: PPCA assumes the same noise variance $\sigma^2$ for each variable, whereas FA assumes different variances $\Psi_{ii}$ ("uniquenesses"). This minor difference has important consequences. Both models can be fit with a general expectation-maximization algorithm. For FA no analytic solution is known, but for PPCA one can analytically derive the solution that EM will converge to (both $\sigma^2$ and $\mathbf W$). Turns out, $\mathbf W_\mathrm{PPCA}$ has columns in the same direction but with a smaller length than standard PCA loadings $\mathbf W_\mathrm{PCA}$ (I omit exact formulas). For that reason I think of PPCA as "almost" PCA: $\mathbf W$ in both cases span the same "principal subspace".

The proof (Tipping and Bishop 1999) is a bit technical; the intuitive reason for why homogeneous noise variance leads to a much simpler solution is that $\mathbf C - \sigma^2 \mathbf I$ has the same eigenvectors as $\mathbf C$ for any value of $\sigma^2$, but this is not true for $\mathbf C - \boldsymbol \Psi$.

So yes, @gung and @ttnphns are right in that FA is based on a generative model and PCA is not, but I think it is important to add that PPCA is also based on a generative model, but is "almost" equivalent to PCA. Then it ceases to seem such an important difference.

Update 2: how come PCA provides best approximation to the covariance matrix, when it is well-known to be looking for maximal variance?

PCA has two equivalent formulations: e.g. first PC is (a) the one maximizing the variance of the projection and (b) the one providing minimal reconstruction error. More abstractly, the equivalence between maximizing variance and minimizing reconstruction error can be seen using Eckart-Young theorem.

If $\mathbf X$ is the data matrix (with observations as rows, variables as columns, and columns are assumed to be centered) and its SVD decomposition is $\mathbf X=\mathbf U\mathbf S\mathbf V^\top$, then it is well known that columns of $\mathbf V$ are eigenvectors of the scatter matrix (or covariance matrix, if divided by the number of observations) $\mathbf C=\mathbf X^\top \mathbf X=\mathbf V\mathbf S^2\mathbf V^\top$ and so they are axes maximizing the variance (i.e. principal axes). But by the Eckart-Young theorem, first $k$ PCs provide the best rank-$k$ approximation to $\mathbf X$: $\mathbf X_k=\mathbf U_k\mathbf S_k \mathbf V^\top_k$ (this notation means taking only $k$ largest singular values/vectors) minimizes $\|\mathbf X-\mathbf X_k\|^2$.

The first $k$ PCs provide not only the best rank-$k$ approximation to $\mathbf X$, but also to the covariance matrix $\mathbf C$. Indeed, $\mathbf C=\mathbf X^\top \mathbf X=\mathbf V\mathbf S^2\mathbf V^\top$, and the last equation provides the SVD decomposition of $\mathbf C$ (because $\mathbf V$ is orthogonal and $\mathbf S^2$ is diagonal). So the Eckert-Young theorem tells us that the best rank-$k$ approximation to $\mathbf C$ is given by $\mathbf C_k = \mathbf V_k\mathbf S_k^2\mathbf V_k^\top$. This can be transformed by noticing that $\mathbf W = \mathbf V\mathbf S$ are PCA loadings, and so $$\mathbf C_k=\mathbf V_k\mathbf S_k^2\mathbf V^\top_k=(\mathbf V\mathbf S)_k(\mathbf V\mathbf S)_k^\top=\mathbf W_k\mathbf W^\top_k.$$

The bottom-line here is that $$ \mathrm{minimizing} \; \left\{\begin{array}{ll} \|\mathbf C-\mathbf W\mathbf W^\top\|^2 \\ \|\mathbf C-\mathbf W\mathbf W^\top-\sigma^2\mathbf I\|^2 \\ \|\mathbf C-\mathbf W\mathbf W^\top-\boldsymbol\Psi\|^2\end{array}\right\} \; \mathrm{leads \: to} \; \left\{\begin{array}{cc} \mathrm{PCA}\\ \mathrm{PPCA} \\ \mathrm{FA} \end{array}\right\} \; \mathrm{loadings},$$ as stated in the beginning.

Update 3: numerical demonstration that PCA$\to$FA when $n \to \infty$

I was encouraged by @ttnphns to provide a numerical demonstration of my claim that as dimensionality grows, PCA solution approaches FA solution. Here it goes.

I generated a $200\times 200$ random correlation matrix with some strong off-diagonal correlations. I then took the upper-left $n \times n$ square block $\mathbf C$ of this matrix with $n=25, 50, \dots 200$ variables to investigate the effect of the dimensionality. For each $n$, I performed PCA and FA with number of components/factors $k=1\dots 5$, and for each $k$ I computed the off-diagonal reconstruction error $$\sum_{i\ne j}\left[\mathbf C - \mathbf W \mathbf W^\top\right]^2_{ij}$$ (note that on the diagonal, FA reconstructs $\mathbf C$ perfectly, due to the $\boldsymbol \Psi$ term, whereas PCA does not; but the diagonal is ignored here). Then for each $n$ and $k$, I computed the ratio of the PCA off-diagonal error to the FA off-diagonal error. This ratio has to be above $1$, because FA provides the best possible reconstruction.

PCA vs FA off-diagonal reconstruction error

On the right, different lines correspond to different values of $k$, and $n$ is shown on the horizontal axis. Note that as $n$ grows, ratios (for all $k$) approach $1$, meaning that PCA and FA yield approximately the same loadings, PCA$\approx$FA. With relatively small $n$, e.g. when $n=25$, PCA performs [expectedly] worse, but the difference is not that strong for small $k$, and even for $k=5$ the ratio is below $1.2$.

The ratio can become large when the number of factors $k$ becomes comparable with the number of variables $n$. In the example I gave above with $n=2$ and $k=1$, FA achieves $0$ reconstruction error, whereas PCA does not, i.e. the ratio would be infinite. But getting back to the original question, when $n=21$ and $k=3$, PCA will only moderately lose to FA in explaining the off-diagonal part of $\mathbf C$.

For an illustrated example of PCA and FA applied to a real dataset (wine dataset with $n=13$), see my answers here:

Best Answer

Update 1: generative models of the data

Update 2: how come PCA provides best approximation to the covariance matrix, when it is well-known to be looking for maximal variance?

Update 3: numerical demonstration that PCA$\to$FA when $n \to \infty$

Related Solutions

Solved – Factor scores vs. construct mean scores in regression analysis

Solved – Are data transformations on non-normal data necessary for an exploratory factor analysis when using the principal axis factoring extraction method

Related Question