Solved – PCA and FA example – calculation of communalities

factor analysispca

I'm trying to understand how Principal Component Analysis and Factor Analysis work by
implementing examples. Although I'm mainly using Python and Numpy here, this isn't Python-specific, as I'd like to know how to get the correct result generally speaking.

There is an example in chapter 16 of Statistics in a Nutshell, but
I can't understand how some of the resulting values are obtained. I'm not sure whether (a) there's something wrong with my implementation/calculations or (b) I'm misunderstanding the "jump" from PCA to FA.
(The examples are on page 300-303. It wouldn't be appropriate to reproduce material from the book, but I hope it's reasonable to quote its short numerical example.)

From the example table (psychometric test results) in the data matrix below:

import numpy as np

# (Table 16-1 in the book)
# Columns are:
# Reading, Music, Arith., Verbal, Sports, Spelling, Geometry
data = np.array(
[[  8,  9,  6,  8,  5,  9, 10],
 [  5,  6,  5,  5,  6,  5,  5],
 [  2,  3,  2,  6,  8,  6,  4],
 [  8,  9, 10,  9,  8, 10,  6],
 [ 10,  7,  1, 10,  5, 10,  2],
 [  9,  8,  4,  9,  1,  7,  2],
 [  3,  9, 10,  2,  6,  4,  9],
 [  8, 10,  3,  8,  5,  7,  2],
 [ 10,  9,  3, 10,  6, 10,  3],
 [  7, 10,  1,  9,  6, 10,  2]])

# Transposing for np.corrcoef
data = data.T

corrmat = np.corrcoef(data)

The values calculated this way using NumPy for the correlation matrix match those in the book (Table 16-2 for those who have it).

Here is the same input data for R:

data <- matrix(c(8, 9, 6, 8, 5, 9, 10, 5, 6, 5, 5, 6, 5, 5, 2, 3, 2, 6, 8, 6, 4, 8, 9, 10, 9, 8, 10, 6, 10, 7, 1, 10, 5, 10, 2, 9, 8, 4, 9, 1, 7, 2, 3, 9, 10, 2, 6, 4, 9, 8, 10, 3, 8, 5, 7, 2, 10, 9, 3, 10, 6, 10, 3, 7, 10, 1, 9, 6, 10, 2), ncol=7, byrow=T)
pc = prcomp(data, scale. = T)

NumPy's output for the correlation matrix is this:

  [[ 1.        ,  0.53484056, -0.25290194,  0.86021546, -0.46870501,
     0.76225482, -0.38632342],
   [ 0.53484056,  1.        ,  0.24875809,  0.26248971, -0.26308503,
     0.38020761,  0.06879192],
   [-0.25290194,  0.24875809,  1.        , -0.50102461,  0.20615027,
    -0.30668865,  0.75803231],
   [ 0.86021546,  0.26248971, -0.50102461,  1.        , -0.23649405,
     0.89522479, -0.56880086],
   [-0.46870501, -0.26308503,  0.20615027, -0.23649405,  1.        ,
     0.05436758,  0.26604241],
   [ 0.76225482,  0.38020761, -0.30668865,  0.89522479,  0.05436758,
     1.        , -0.29078439],
   [-0.38632342,  0.06879192,  0.75803231, -0.56880086,  0.26604241,
    -0.29078439,  1.        ]]

I'm using the following to perform the PCA. The data matrix is turned into the pca_output matrix.

The cummulative % also match the book's example (Table 16-4).

eigenvalues, eigenvectors = np.linalg.eig(corrmat)

# Order the eigenvalues by decreasing value
# (and then order eigenvectors).
evals_order = np.argsort(-eigenvalues)
eigenvalues = eigenvalues[evals_order]
eigenvectors = eigenvectors[:, evals_order]

# pca_output: columns are the principal
# components, after transposition (used here)
pca_output = np.dot(eigenvectors.T, data).T

cummulative_perc_variance = 100 * np.array(
    [ eigenvalues[:i].sum()/eigenvalues.sum() for i in range(1, eigenvalues.shape[0] + 1)])

The first 3 eigenvectors I get are:

  [[-0.48310951, -0.2554222 , -0.08049741],
   [-0.20665755, -0.60274754, -0.16463817],
   [ 0.31165047, -0.56586657,  0.02618752],
   [-0.5112334 , -0.00690029,  0.19728832],
   [ 0.21569808,  0.04569528,  0.83434948],
   [-0.43878073, -0.18261299,  0.4643702 ],
   [ 0.35546879, -0.46450701,  0.12258562]]

I get the same results using R's prcomp:

> pc
Standard deviations:
[1] 1.8676341 1.2850439 1.0568965 0.6517665 0.4837756 0.2582021 0.1344177

Rotation:
            PC1          PC2         PC3        PC4        PC5         PC6         PC7
[1,] -0.4831095  0.255422196 -0.08049741  0.2324959 -0.1897195  0.76737470 -0.12638483
[2,] -0.2066575  0.602747536 -0.16463817 -0.6995409  0.2296628 -0.05498277  0.14750200
[3,]  0.3116505  0.565866565  0.02618752  0.1601325 -0.7081318 -0.22419271 -0.06802832
[4,] -0.5112334  0.006900293  0.19728832  0.2433956 -0.1357962 -0.28997811  0.73341720
[5,]  0.2156981 -0.045695275  0.83434948 -0.3293147 -0.1060965  0.33662557  0.14908327
[6,] -0.4387807  0.182612988  0.46437020  0.1413347  0.1812280 -0.38481239 -0.59798379
[7,]  0.3554688  0.464507012  0.12258562  0.4932350  0.5892964  0.11120226  0.19982732

The book then produces this table (Table 16-3: Communalities) and says: "The first step after computing PCA is to examine what proportion of variance is accounted for by the factor structure.
This is done by examining the communalities […]."

$$
\begin{array}{r|r|r}
& \mbox{Initial} & \mbox{Extraction} \\
\hline
\mbox{Reading} & 1.0 & 0.929 \\
\mbox{Music} &1.0 & 0.779 \\
\mbox{Arith.} &1.0 & 0.868 \\
\mbox{Verbal} &1.0 & 0.955 \\
\mbox{Sports} &1.0 & 0.943 \\
\mbox{Spelling} &1.0 & 0.967 \\
\mbox{Geometry} &1.0 & 0.814 \\
\end{array}
$$

It's not clear to me where these extracted communalities values come from.
My understanding is that they are the sum of the squares of the loadings in a given row of the eigenvector matrix (over the first p columns where p is the number of selected components).
According to the eigenvalues I get above, this would be $(-0.483)^2 + (-0.255)^2 + (-0.080)^2 = 0.305$ for "Reading" (first row) and 3 Components, and not $0.929$ (as in the table in the book).

The book shows a table (Table 16-7) called "unrotated component matrix", as follows:

$$
\begin{array}{r|r|r}
& Comp 1 & Comp 2 & Comp 3 \\
\hline
\mbox{Reading} & 0.902 & 0.328 & -0.085 \\
\mbox{Music} & 0.386 & … & … \\
\mbox{Arith.} & -0.582 & … & … \\
\mbox{Verbal} & 0.955 & … & … \\
\mbox{Sports} & -0.403 & … & … \\
\mbox{Spelling} & 0.819 & … & … \\
\mbox{Geometry} & -0.664 & … & … \\
\end{array}
$$

This table does explain the communalities values ($(0.902)^2 + (0.328)^2 + (-0.085)^2 = 0.929$).

My understanding was that the unrotated component matrix when doing FA was the same as the matrix of eigenvectors obtained for the PCA.
From these examples, this doesn't seem to be the case.

How should the "unrotated component matrix" be obtained and how does it differ from the PCA eigenvectors?

Best Answer

Component/factor matrix is the matrix of component/factor loadings. "Loading" pertains to it, not to eigenvector matrix. That matrix is obtained from eigenvector matrix by normalizing the columns of the latter: column sum-of-squares (which are all 1 there) are brought to corresponding eigenvalues. $a_{ij}=\sqrt{\lambda_j}u_{ij}$, where $a_{ij}$ is the loading and $u_{ij}$ is the element of eigenvector matrix and $\lambda_j$ is the eigenvalue.

Scikit-learn PCA

pca = PCA()

Scale and transform data to get Principal Components

X_reduced = pca.fit_transform(scale(X))

Variance (% cumulative) explained by the principal components

np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)

array([  73.39,   93.1 ,   98.63,   99.89,  100.  ])

Seems like the first two components indeed explain most of the variance in the data.

10-fold CV, with shuffle

n = len(X_reduced)
kf_10 = cross_validation.KFold(n, n_folds=10, shuffle=True, random_state=2)

regr = LinearRegression()
mse = []

Do one CV to get MSE for just the intercept (no principal components in regression)

score = -1*cross_validation.cross_val_score(regr, np.ones((n,1)), y.ravel(), cv=kf_10, scoring='mean_squared_error').mean()    
mse.append(score)

Do CV for the 5 principle components, adding one component to the regression at the time

for i in np.arange(1,6):
    score = -1*cross_validation.cross_val_score(regr, X_reduced[:,:i], y.ravel(), cv=kf_10, scoring='mean_squared_error').mean()
    mse.append(score)

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(12,5))
ax1.plot(mse, '-v')
ax2.plot([1,2,3,4,5], mse[1:6], '-v')
ax2.set_title('Intercept excluded from plot')

for ax in fig.axes:
    ax.set_xlabel('Number of principal components in regression')
    ax.set_ylabel('MSE')
    ax.set_xlim((-0.2,5.2))

Scikit-learn PLS regression

mse = []

kf_10 = cross_validation.KFold(n, n_folds=10, shuffle=True, random_state=2)

for i in np.arange(1, 6):
    pls = PLSRegression(n_components=i, scale=False)
    pls.fit(scale(X_reduced),y)
    score = cross_validation.cross_val_score(pls, X_reduced, y, cv=kf_10, scoring='mean_squared_error').mean()
    mse.append(-score)

plt.plot(np.arange(1, 6), np.array(mse), '-v')
plt.xlabel('Number of principal components in PLS regression')
plt.ylabel('MSE')
plt.xlim((-0.2, 5.2))

Solved – PCA and exploratory Factor Analysis on the same dataset: differences and similarities; factor model vs PCA

Both models - principal-component and common-factor - are similar straightforward linear regressional models predicting observed variables by latent variables. Let us have centered variables V1 V2 ... Vp and we chose to extract 2 components/factors FI and FII. Then the model is the system of equations:

$V_1 = a_{1I}F_I + a_{1II}F_{II} + E_1$

$V_2 = a_{2I}F_I + a_{2II}F_{II} + E_2$

$...$

$V_p = …$

where coefficient a is a loading, F is a factor or a component, and variable E is regression residuals. Here, FA model differs from PCA model exactly by that FA imposes the requirement: variables E1 E2 ... Ep (the error terms which are uncorrelated with the Fs) must not correlate with each other (See pictures) . These error variables FA calls "unique factors"; their variances are known ("uniquenesses") but their casewise values are not. Therefore, factor scores F are computed as good approximations only, they are not exact.

(A matrix algebra presentation of this common factor analysis model is in Footnote $^1$.)

Whereas in PCA the error variables from predicting different variables may freely correlate: nothing is imposed on them. They represent that "dross" we've taken the left-out p-2 dimensions for. We know the values of E and so we can compute component scores F as exact values.

That was the difference between PCA model and FA model.

It is due to that above outlined difference, that FA is able to explain pairwise correlations (covariances). PCA generally cannot do it (unless the number of extracted components = p); it can only explain multivariate variance$^2$. So, as long as "Factor analysis" term is defined via the aim to explain correlations, PCA is not factor analysis. If "Factor analysis" is defined more broadly as a method providing or suggesting latent "traits" which could be interpreted, PCA can be seen is a special and simplest form of factor analysis.

Sometimes - in some datasets under certain conditions - PCA leaves E terms which almost do not intercorrelate. Then PCA can explain correlations and become like FA. It is not very uncommon with datasets with many variables. This made some observers to claim that PCA results become close to FA results as data grows. I don't think it is a rule, but the tendency may indeed be. Anyway, given their theoretical differences, it is always good to select the method consciously. FA is a more realistic model if you want to reduce variables down to latents which you're going to regard as real latent traits standing behind the variables and making them correlate.

But if you have another aim - reduce dimensionality while keeping the distances between the points of the data cloud as much as possible - PCA is better than FA. (However, iterative Multidimensional scaling (MDS) procedure will be even better then. PCA amounts to noniterative metric MDS.) If you further don't bother with the distances much and are interested only in preserving as much of the overall variance of the data as possible, by few dimensions - PCA is an optimal choice.

$^1$ Factor analysis data model: $\mathbf {V=FA'+E}diag \bf(u)$, where $\bf V$ is n cases x p variables analyzed data (columns centered or standardized), $\bf F$ is n x m common factor values (the unknown true ones, not factor scores) with unit variance, $\bf A$ is p x m matrix of common factor loadings (pattern matrix), $\bf E$ is n x p unique factor values (unknown), $\bf u$ is the p vector of the unique factor loadings equal to the sq. root of the uniquenesses ($\bf u^2$). Portion $\mathbf E diag \bf(u)$ could be just labeled as "E" for simplicity, as it is in the formulas opening the answer.

Principal assumptions of the model:

$\bf F$ and $\bf E$ variables (common and unique factors, respectively) have zero means and unit variances; $\bf E$ is typically assumed multivariate normal but $\bf F$ in general case needs not be multivariate normal (if both are assumed multivariate normal then $\bf V$ are so, too);
$\bf E$ variables are uncorrelated with each other and are uncorrelated with $\bf F$ variables.

$^2$ It follows from the common factor analysis model that loadings $\bf A$ of m common factors (m<p variables), also denoted $\bf A_{(m)}$, should closely reproduce observed covariances (or correlations) between the variables, $\bf \Sigma$. So that if factors are orthogonal, the fundamental factor theorem states that

$\bf \hat{\Sigma} = AA'$ and $\bf \Sigma \approx \hat{\Sigma} + \it diag \bf (u^2)$,

where $\bf \hat{\Sigma}$ is the matrix of reproduced covariances (or correlations) with common variances ("communalities") on its diagonal; and unique variances ("uniquenesses") - which are variances minus communalities - are the vector $\bf u^2$. The off-diagonal discrepancy ($\approx$) is due to that factors is a theoretical model generating data, and as such it is simpler than the observed data it was built on. The main causes of the discrepancy between the observed and the reproduced covariances (or correlations) may be: (1) number of factors m is not statistically optimal; (2) partial correlations (these are p(p-1)/2 factors that do not belong to common factors) are pronounced; (3) communalities not well assesed, their initial values had been poor; (4) relationships are not linear, using linear model is questionnable; (5) model "subtype" produced by the extraction method is not optimal for the data (see about different extraction methods). In other words, some FA data assumptions are not fully met.

As for plain PCA, it reproduces covariances by the loadings exactly when m=p (all components are used) and it usually fails to do it if m<p (only few 1st components retained). Factor theorem for PCA is:

$\bf \Sigma= AA'_{(p)} = AA'_{(m)} + AA'_{(p-m)}$,

so both $\bf A_{(m)}$ loadings and dropped $\bf A_{(p-m)}$ loadings are mixtures of communalities and uniquenesses and neither individually can help restore covariances. The closer m is to p, the better PCA restores covariances, as a rule, but small m (which often is of our interest) don't help. This is different from FA, which is intended to restore covariances with quite small optimal number of factors. If $\bf AA'_{(p-m)}$ approaches diagonality PCA becomes like FA, with $\bf A_{(m)}$ restoring all the covariances. It happens occasionally with PCA, as I've already mentioned. But PCA lacks algorithmic ability to force such diagonalization. It is FA algorithms who do it.

FA, not PCA, is a data generative model: it presumes few "true" common factors (of usually unknown number, so you try out m within a range) which generate "true" values for covariances. Observed covariances are the "true" ones + small random noise. (It is due to the performed diagonalization that leaved $\bf A_{(m)}$ the sole restorer of all covariances, that the above noise can be small and random.) Trying to fit more factors than optimal amounts to overfitting attempt, and not necessarily efficient overfitting attempt.

Both FA and PCA aim to maximize $trace(\bf A'A_{(m)})$, but for PCA it is the only goal; for FA it is the concomitant goal, the other being to diagonalize off uniquenesses. That trace is the sum of eigenvalues in PCA. Some methods of extraction in FA add more concomitant goals at the expense of maximizing the trace, so it is not of principal importance.

To summarize the explicated differences between the two methods. FA aims (directly or indirectly) at minimizing differences between individual corresponding off-diagonal elements of $\bf \Sigma$ and $\bf AA'$. A successful FA model is the one that leaves errors for the covariances small and random-like (normal or uniform about 0, no outliers/fat tails). PCA only maximizes $trace(\bf AA')$ which is equal to $trace(\bf A'A)$ (and $\bf A'A$ is equal to the covariance matrix of the principal components, which is diagonal matrix). Thus PCA isn't "busy" with all the individual covariances: it simply cannot, being merely a form of orthogonal rotation of data.

Thanks to maximizing the trace - the variance explained by m components - PCA is accounting for covariances, since covariance is shared variance. In this sense PCA is "low-rank approximation" of the whole covariance matrix of variables. And when seen from the viewpoint of observations this approximation is the approximation of the Euclidean-distance matrix of observations (which is why PCA is metric MDS called "Principal coordinate analysis). This fact should not screen us from the reality that PCA does not model covariance matrix (each covariance) as generated by few living latent traits that are imaginable as transcendent towards our variables; the PCA approximation remains immanent, even if it is good: it is simplification of the data.

If you want to see step-by-step computations done in PCA and FA, commented and compared, please look in here.

Best Answer

Related Solutions

Solved – Principal Component Analysis and Regression in Python

Scikit-learn PCA

Scikit-learn PLS regression

Solved – PCA and exploratory Factor Analysis on the same dataset: differences and similarities; factor model vs PCA

Related Question