Factor Analysis – How Does Factor Analysis Explain Covariance While PCA Explains Variance?

factor analysisgeometrypca

Here is a quote from Bishop's "Pattern Recognition and Machine Learning" book, section 12.2.4 "Factor analysis":

enter image description here

According to the highlighted part, factor analysis captures the covariance between variables in the matrix $W$. I wonder HOW?

Here is how I understand it. Say $x$ is the observed $p$-dimensional variable, $W$ is the factor loading matrix, and $z$ is the factor score vector. Then we have $$x=\mu+Wz+\epsilon,$$ that is
\begin{align*}
\begin{pmatrix}
x_1\\
\vdots\\
x_p
\end{pmatrix}
=
\begin{pmatrix}
\mu_1\\
\vdots\\
\mu_p
\end{pmatrix}
+
\begin{pmatrix}
\vert & & \vert\\
w_1 & \ldots & w_m\\
\vert & & \vert
\end{pmatrix}
\begin{pmatrix}
z_1\\
\vdots\\
z_m
\end{pmatrix}
+\epsilon,
\end{align*}
and each column in $W$ is a factor loading vector
$$w_i=\begin{pmatrix}w_{i1}\\ \vdots\\ w_{ip}\end{pmatrix}.$$
Here as I wrote, $W$ has $m$ columns meaning there are $m$ factors under consideration.

Now here is the point, according to the highlighted part, I think the loadings in each column $w_i$ explain the covariance in the observed data, right?

For example, let's take a look at the first loading vector $w_1$, for $1\le i,j,k\le p$, if $w_{1i}=10$, $w_{1j}=11$ and $w_{1k}=0.1$, then I'd say $x_i$ and $x_j$ are highly correlated, whereas $x_k$ seems uncorrelated with them, am I right?

And if this is how factor analysis explain the covariance between observed features, then I'd say PCA also explains the covariance, right?

Best Answer

The distinction between Principal component analysis and Factor analysis is discussed in numerous textbooks and articles on multivariate techniques. You may find the full thread, and a newer one, and odd answers, on this site, too.

I'm not going to make it detailed. I've already given a concise answer and a longer one and would like now to clarify it with a pair of pictures.

Graphical representation

The picture below explains PCA. (This was borrowed from here where PCA is compared with Linear regression and Canonical correlations. The picture is the vector representation of variables in the subject space; to understand what it is you may want to read the 2nd paragraph there.)

enter image description here

PCA configuration on this picture was described there. I will repeat most principal things. Principal components $P_1$ and $P_2$ lie in the same space that is spanned by the variables $X_1$ and $X_2$, "plane X". Squared length of each of the four vectors is its variance. The covariance between $X_1$ and $X_2$ is $cov_{12}= |X_1||X_2|r$, where $r$ equals the cosine of the angle between their vectors.

The projections (coordinates) of the variables on the components, the $a$'s, are the loadings of the components on the variables: loadings are the regression coefficients in the linear combinations of modeling variables by standardized components. "Standardized" - because information about components' variances is already absorbed in loadings (remember, loadings are eigenvectors normalized to the respective eigenvalues). And due to that, and to the fact that components are uncorrelated, loadings are the covariances between the variables and the components.

Using PCA for dimensionality/data reduction aim compels us to retain only $P_1$ and to regard $P_2$ as the remainder, or error. $a_{11}^2+a_{21}^2= |P_1|^2$ is the variance captured (explained) by $P_1$.


The picture below demonstrates Factor analysis performed on the same variables $X_1$ and $X_2$ with which we did PCA above. (I will speak of common factor model, for there exist other: alpha factor model, image factor model.) Smiley sun helps with lighting.

The common factor is $F$. It is what is the analogue to the main component $P_1$ above. Can you see the difference between these two? Yes, clearly: the factor does not lie in the variables' space "plane X".

How to get that factor with one finger, i.e. to do factor analysis? Let's try. On the previous picture, hook the end of $P_1$ arrow by your nail tip and pull away from "plane X", while visualizing how two new planes appear, "plane U1" and "plane U2"; these connecting the hooked vector and the two variable vectors. The two planes form a hood, X1 - F - X2, above "plane X".

enter image description here

Continue to pull while contemplating the hood and stop when "plane U1" and "plane U2" form 90 degrees between them. Ready, factor analysis is done. Well, yes, but not yet optimally. To do it right, like packages do, repeat the whole excercise of pulling the arrow, now adding small left-right swings of your finger while you pull. Doing so, find the position of the arrow when the sum of squared projections of both variables onto it is maximized, while you attain to that 90 degree angle. Stop. You did factor analysis, found the position of the common factor $F$.

Again to remark, unlike principal component $P_1$, factor $F$ does not belong to variables' space "plane X". It therefore is not a function of the variables (principal component is, and you can make sure from the two top pictures here that PCA is fundamentally two-directional: predicts variables by components and vice versa). Factor analysis is thus not a description/simplification method, like PCA, it is modeling method whereby latent factor steeres observed variables, one-directionally.

Loadings $a$'s of the factor on the variables are like loadings in PCA; they are the covariances and they are the coefficients of modeling variables by the (standardized) factor. $a_{1}^2+a_{2}^2= |F|^2$ is the variance captured (explained) by $F$. The factor was found as to maximize this quantity - as if a principal component. However, that explained variance is no more variables' gross variance, - instead, it is their variance by which they co-vary (correlate). Why so?

Get back to the pic. We extracted $F$ under two requirements. One was the just mentioned maximized sum of squared loadings. The other was the creation of the two perpendicular planes, "plane U1" containing $F$ and $X_1$, and "plane U2" containing $F$ and $X_2$. This way each of the X variables appeared decomposed. $X_1$ was decomposed into variables $F$ and $U_1$, mutually orthogonal; $X_2$ was likewise decomposed into variables $F$ and $U_2$, also orthogonal. And $U_1$ is orthogonal to $U_2$. We know what is $F$ - the common factor. $U$'s are called unique factors. Each variable has its unique factor. The meaning is as follows. $U_1$ behind $X_1$ and $U_2$ behind $X_2$ are the forces that hinder $X_1$ and $X_2$ to correlate. But $F$ - the common factor - is the force behind both $X_1$ and $X_2$ that makes them to correlate. And the variance being explained lie along that common factor. So, it is pure collinearity variance. It is that variance that makes $cov_{12}>0$; the actual value of $cov_{12}$ being determined by inclinations of the variables towards the factor, by $a$'s.

A variable's variance (vector's length squared) thus consists of two additive disjoint parts: uniqueness $u^2$ and communality $a^2$. With two variables, like our example, we can extract at most one common factor, so communality = single loading squared. With many variables we might extract several common factors, and a variable's communality will be the sum of its squared loadings. On our picture, the common factors space is unidimensional (just $F$ itself); when m common factors exist, that space is m-dimensional, with communalities being variables' projections on the space and loadings being variables' as well as those projections' projections on the factors that span the space. Variance explained in factor analysis is the variance within that common factors' space, different from variables' space in which components explain variance. The space of the variables is in the belly of the combined space: m common + p unique factors.

enter image description here

Just glance at the current pic please. There were several (say, $X_1$, $X_2$, $X_3$) variables with which factor analysis was done, extracting two common factors. The factors $F_1$ and $F_2$ span the common factor space "factor plane". Of the bunch of analysed variables only one ($X_1$) is shown on the figure. The analysis decomposed it in two orthogonal parts, communality $C_1$ and unique factor $U_1$. Communality lies in the "factor plane" and its coordinates on the factors are the loadings by which the common factors load $X_1$ (= coordinates of $X_1$ itself on the factors). On the picture, communalities of the other two variables - projections of $X_2$ and of $X_3$ - are also displayed. It would be interesting to remark that the two common factors can, in a sense, be seen as the principal components of all those communality "variables". Whereas usual principal components summarize by seniority the multivariate total variance of the variables, the factors summarize likewise their multivariate common variance. $^1$

Why needed all that verbiage? I just wanted to give evidence to the claim that when you decompose each of the correlated variables into two orthogonal latent parts, one (A) representing uncorrelatedness (orthogonality) between the variables and the other part (B) representing their correlatedness (collinearity), and you extract factors from the combined B's only, you will find yourself explaining pairwise covariances, by those factors' loadings. In our factor model, $cov_{12} \approx a_1a_2$ - factors restore individual covariances by means of loadings. In PCA model, it is not so since PCA explains undecomposed, mixed collinear+orthogonal native variance. Both strong components that you retain and subsequent ones that you drop are fusions of (A) and (B) parts; hence PCA can tap, by its loadings, covariances only blindly and grossly.


Contrast list PCA vs FA

  • PCA: operates in the space of the variables. FA: trancsends the space of the variables.
  • PCA: takes variability as is. FA: segments variability into common and unique parts.
  • PCA: explains nonsegmented variance, i.e. trace of the covariance matrix. FA: explains common variance only, hence explains (restores by loadings) correlations/covariances, off-diagonal elements of the matrix. (PCA explains off-diagonal elements too - but in passing, offhand manner - simply because variances are shared in a form of covariances.)
  • PCA: components are theoretically linear functions of variables, variables are theoretically linear functions of components. FA: variables are theoretically linear functions of factors, only.
  • PCA: empirical summarizing method; it retains m components. FA: theoretical modeling method; it fits fixed number m factors to the data; FA can be tested (Confirmatory FA).
  • PCA: is simplest metric MDS, aims to reduce dimensionality while indirectly preserving distances between data points as much as possible. FA: Factors are essential latent traits behind variables which make them to correlate; the analysis aims to reduce data to those essences only.
  • PCA: rotation/interpretation of components - sometimes (PCA is not enough realistic as a latent-traits model). FA: rotation/interpretation of factors - routinely.
  • PCA: data reduction method only. FA: also a method to find clusters of coherent variables (this is because variables cannot correlate beyond a factor).
  • PCA: loadings and scores are independent of the number m of components "extracted". FA: loadings and scores depend on the number m of factors "extracted".
  • PCA: component scores are exact component values. FA: factor scores are approximates to true factor values, and several computational methods exist. Factor scores do lie in the space of the variables (like components do) while true factors (as embodied by factor loadings) do not.
  • PCA: usually no assumptions. FA: assumption of weak partial correlations; sometimes multivariate normality assumption; some datasets may be "bad" for analysis unless transformed.
  • PCA: noniterative algorithm; always successful. FA: iterative algorithm (typically); sometimes nonconvergence problem; singularity may be a problem.

$^1$ For meticulous. One might ask where are variables $X_2$ and $X_3$ themselves on the pic, why were they not drawn? The answer is that we can't draw them, even theoretically. The space on the picture is 3d (defined by "factor plane" and the unique vector $U_1$; $X_1$ lying on their mutual complement, plane shaded grey, that's what corresponds to one slope of the "hood" on the picture No.2), and so our graphic resources are exhausted. The three dimensional space spanned by three variables $X_1$, $X_2$, $X_3$ together is another space. Neither "factor plane" nor $U_1$ are the subspaces of it. It's what is different from PCA: factors do not belong to the variables' space. Each variable separately lies in its separate grey plane orthogonal to "factor plane" - just like $X_1$ shown on our pic, and that is all: if we were to add, say, $X_2$ to the plot we should have invented 4th dimension. (Just recall that all $U$s have to be mutually orthogonal; so, to add another $U$, you must expand dimensionality farther.)

Similarly as in regression the coefficients are the coordinates, on the predictors, both of the dependent variable(s) and of the prediction(s) (See pic under "Multiple Regression", and here, too), in FA loadings are the coordinates, on the factors, both of the observed variables and of their latent parts - the communalities. And exactly as in regression that fact did not make the dependent(s) and the predictors be subspaces of each other, - in FA the similar fact does not make the observed variables and the latent factors be subspaces of each other. A factor is "alien" to a variable in a quite similar sense as a predictor is "alien" to a dependent response. But in PCA, it is other way: principal components are derived from the observed variables and are confined to their space.

So, once again to repeat: m common factors of FA are not a subspace of the p input variables. On the contrary: the variables form a subspace in the m+p (m common factors + p unique factors) union hyperspace. When seen from this perspective (i.e. with the unique factors attracted too) it becomes clear that classic FA is not a dimensionality shrinkage technique, like classic PCA, but is a dimensionality expansion technique. Nevertheless, we give our attention only to a small (m dimensional common) part of that bloat, since this part solely explains correlations.

Related Question