Solved – the meaning of the variable “scores” in MATLAB’s PCA

autoencodersmachine learningpca

I was trying to understand what the score variable was in MATLAB. The PCA documentation says:

Principal component scores are the representations of X in the
principal component space. Rows of score correspond to observations,
and columns correspond to components.

What I find confusing the following:

scores are the representations of X in the principal component space.

since I am not sure what that means precisely. For me (at least from a auto-encoding perspective) the representation of the data $X_N \in \mathbb{R}^{D \times N}$ in the principal component space would be the projection of all the data set points $X_N$ (where the data set points are the columns) on the column space of $U$, the eigenvectors of the covariance matrix $C_N = \frac{1}{N} \sum^{N}_{n=1} (x^{(n)} – \bar{x}) ({x^{(n)}} – \bar{x})^T = \frac{1}{N} (X-\bar{X})(X – \bar{X})^{T}$. Therefore, score should be the best linear combination of the principal components $U$.

For one single data vector $x^{(i)}$ one can notice the following:

$$ a^{(a)} = \left(
\begin{array}{c}
u^T_1 x^{(n)}\\
\vdots \\
u^T_k x^{(n)}\\
\vdots \\
u^T_K x^{(n)}
\end{array}
\right)=U^Tx^{(n)}$$

produces the coefficients of projections onto each principal component. Thus, each component $k$ of $a^{(i)}_k$ tells you how much the data $x^{(i)}$ projects on the direction of eigenvector $u_k$. Thus one can reconstruct one single data as follows:

$$\tilde{x}^{(i)} = \sum^{K}_{k=1} a^{(i)}_k u_k = U a^{(i)} = U U^T x^{(i)} $$

From the above its not to hard to see that the following equation will reconstruct the whole data matrix $X_N$:

$$ \tilde{X}_N = U U^T X_N$$

therefore what occurred to me to understand what the variable score actually represents was to compare it with the above equation. Thus, I wrote the following script that does exactly that:

D = 3
N = 5
X = rand(D, N);
%% process data
x_mean = mean(X, 2); %% computes the mean of the data x_mean = sum(x^(i))
X_centered = X - repmat(x_mean, [1,N]);
%% PCA
[coeff, score, latent, ~, ~, mu] = pca(X'); % coeff =  U
[U, S, V] = svd(X_centered); % coeff = U
%% Reconstruct data
X_tilde_U = U * U'*X
X_tilde_coeff = coeff*coeff'*X
score % unfortunately not the same as the above matrices

unfortunately, I discovered that score was not the same as $\tilde{X}^{(i)}$. What is it though? Thus, the points that I wanted to address were:

What does score actually represents? What is a mathematical and intuitive explanation of what it is?
If I want to use PCA as the tool to reconstruct vectors (or say images) as in a linear auto-encoder (aka PCA) should I use the variable score or should I use what I understand as a reconstruction $ \tilde{X}_N = U U^T X_N$?

After doing some more digging in that documentation I found that one can make what I call a reconstruction with the following code:

X_tilde_score = ( score * coeff' + + repmat(mu,[N,1]) )';

Which translates in equations to:

$$ \tilde{X} = (score U^T + \bar{X})^T$$

where $\bar{X}$ is the concatenation of the mean vector $\bar{x} = \frac{1}{N} \sum^N_{i=1} x^{(i)}$.

After some rearranging one can get:

$$ scores = U^T (\tilde{X} – \bar{X}) = U^T(X – \bar{X})$$

which seems a little weird to me because that is not what I would have called "representations of X in the principal component space". It doesn't even seem to be a projection because it does not even obey $P^2 = P$ (since $U^TU^T$ doesn't make sense as its rectangular). Then I was wondering what were the developers thinking when they defined scores? Why would returning such a thing be good instead of $\tilde{X}$? Is there something about PCA I don't know or that I don't understand and hence, why I miss the purpose of score? Why is it meaningful to define scores that way? (I don't think they "wrong" or its a bad definition, I genuinely want to understand the motivation for such a definition for score)

If it helps to understand my perspective (and why I might be asking this for someone who thinks its an obvious answer) I mostly come from a Machine Learning, Linear Algebra and Computer Science background. In particular, I find auto-encoders interesting right now.

Best Answer

This is a perfectly fine definition based on the resources they cite (eg. Jolliffe, 2002); it is at no point wrong. To your particular questions:

By score they represent the projections $\Xi$ of the centred data in the linear space defined by the eigenvectors $\Phi$. You can immediately check this in your script with something like: all( abs(score) - abs(X_centered' * U) < 2*eps) (I use the abs to ensure we issues with sign).
You can produce the $k$-th dimensional approximation of your centred data by using the score of the $K$ first principal components of them. That is: $\hat{X}^K_{c} = \sum_{i=1}^K \xi_i \phi_i$. Assuming $K=5$ in your script this is plainly (coeff * score') which numerically equates the centred sample: all(abs( X_centered - (coeff * score')) < 2*eps).

I believe that some of your misconception stems from the fact that you say: "score should be the best linear combination of the principal components $U$", but unfortunately this is not the case. The score dictate which it the best linear combination of the principal components $U$ to reconstruct the data in terms of fraction-of-variance-explained but they are not the results of that combination. In terms of PCA, SVD contains only the left singular vectors, $U$ (the eigenvectors of the covariance matrix of $X$) and the singular values S (the square root of the eigenvalues of the covariance matrix of $X$, more information here); nothing about the scores $\Xi$. You will need to project the centred sample $X_c$ using $U$ to get the scores $\Xi$. Inversely, if you know use $\Xi \Phi^T$ you can reconstruct the data back.

To recap: score are the projections of the centred data in the linear space defined by the eigenvectors of the covariance matrix of $X$. This exactly your final result: $\text{scores} = U^T (X - \bar{X})$.

A side-comment: When I started reading on PCA I first try to get the covariance derivation right and then moved to the SVD. I believe that the covariance methodology is a bit easier to follow and somewhat more intuitive in terms of Statistics as well as physical interpretation. Maybe you want to nail that down first and then move to the SVD methodology.

Methods of computation of factor/component scores

After a series of comments I decided finally to issue an answer (based on the comments and more). It is about computing component scores in PCA and factor scores in factor analysis.

Factor/component scores are given by $\bf \hat{F}=XB$, where $\bf X$ are the analyzed variables (centered if the PCA/factor analysis was based on covariances or z-standardized if it was based on correlations). $\bf B$ is the factor/component score coefficient (or weight) matrix. How can these weights be estimated?

Notation

$\bf R$ - p x p matrix of variable (item) correlations or covariances, whichever was factor/PCA analyzed.

$\bf P$ - p x m matrix of factor/component loadings. These might be loadings after extraction (often also denoted $\bf A$) whereupon the latents are orthogonal or practically so, or loadings after rotation, orthogonal or oblique. If the rotation was oblique, it must be pattern loadings.

$\bf C$ - m x m matrix of correlations between the factors/components after their (the loadings) oblique rotation. If no rotation or orthogonal rotation was performed, this is identity matrix.

$\bf \hat R$ - p x p reduced matrix of reproduced correlations/covariances, $\bf = PCP'$ ($\bf = PP'$ for orthogonal solutions), it contains communalities on its diagonal.

$\bf U_2$ - p x p diagonal matrix of uniquenesses (uniqueness + communality = diagonal element of $\bf R$). I'm using "2" as subscript here instead of superscript ($\bf U^2$) for readability convenience in formulas.

$\bf R^*$ - p x p full matrix of reproduced correlations/covariances, $\bf = \hat R + U_2$.

$\bf M^+$ - pseudoinverse of some matrix $\bf M$; if $\bf M$ is full-rank, $\bf M^+ = (M'M)^{-1}M'$.

$\bf M^{power}$ - for some square symmetric matrix $\bf M$ its raising to $power$ amounts to eigendecomposing $\bf HKH'=M$, raising the eigenvalues to the power and composing back: $\bf M^{power}=HK^{power}H'$.

Coarse method of computing factor/component scores

This popular/traditional approach, sometimes called Cattell's, is simply averaging (or summing up) values of items which are loaded by the same factor. Mathematically, it amounts to setting weights $\bf B=P$ in computation of scores $\bf \hat{F}=XB$. There is three main versions of the approach: 1) Use loadings as they are; 2) Dichotomize them (1 = loaded, 0 = not loaded); 3) Use loadings as they are but zero-off loadings smaller than some threshold.

Often with this approach when items are on the same scale unit, values $\bf X$ are used just raw; though not to break the logic of factoring one would better use the $\bf X$ as it entered the factoring - standardized (= analysis of correlations) or centered (= analysis of covariances).

The principal disadvantage of the coarse method of reckoning factor/component scores in my view is that it does not account for correlations between the loaded items. If items loaded by a factor tightly correlate and one is loaded stronger then the other, the latter can be reasonably considered a younger duplicate and its weight could be lessened. Refined methods do it, but coarse method cannot.

Coarse scores are of course easy to compute because no matrix inversion is needed. Advantage of the coarse method (explaining why it is still widely used in spite of computers availability) is that it gives scores which are more stable from sample to sample when sampling is not ideal (in the sense of representativeness and size) or the items for analysis were not well selected. To cite one paper, "The sum score method may be most desirable when the scales used to collect the original data are untested and exploratory, with little or no evidence of reliability or validity". Also, it does not require to understand "factor" necessarily as univariate latent essense, as factor analysis model requires it (see, see). You could, for example, conceptuilize a factor as a collection of phenomena - then to sum the item values is reasonable.

Refined methods of computing factor/component scores

These methods are what factor analytic packages do. They estimate $\bf B$ by various methods. While loadings $\bf A$ or $\bf P$ are the coefficients of linear combinations to predict variables by factors/components, $\bf B$ are the coefficients to compute factor/component scores out of variables.

The scores computed via $\bf B$ are scaled: they have variances equal to or close to 1 (standardized or near standardized) - not the true factor variances (which equal the sum of squared structure loadings, see Footnote 3 here). So, when you need to supply factor scores with the true factor's variance, multiply the scores (having standardized them to st.dev. 1) by the sq. root of that variance.

You may preserve $\bf B$ from the analysis done, to be able to compute scores for new coming observations of $\bf X$. Also, $\bf B$ may be used to weight items constituting a scale of a questionnaire when the scale is developed from or validated by factor analysis. (Squared) coefficients of $\bf B$ can be interpreted as contributions of items to factors. Coefficints can be standardized like regression coefficient is standardized $\beta=b \frac{\sigma_{item}}{\sigma_{factor}}$ (where $\sigma_{factor}=1$) to compare contributions of items with different variances.

See an example showing computations done in PCA and in FA, including computation of scores out of the score coefficient matrix.

Geometric explanation of loadings $a$'s (as perpendicular coordinates) and score coefficients $b$'s (skew coordinates) in PCA settings is presented on the first two pictures here.

Now to the refined methods.

The methods

Computation of $\bf B$ in PCA

When component loadings are extracted but not rotated, $\bf B= AL^{-1}$, where $\bf L$ is the diagonal matrix comprised of m eigenvalues; this formula amounts to simply dividing each column of $\bf A$ by the respective eigenvalue - the component's variance.

Equivalently, $\bf B= (P^+)'$. This formula holds also for components (loadings) rotated, orthogonally (such as varimax), or obliquely.

Some of methods used in factor analysis (see below), if applied within PCA return the same result.

Component scores computed have variances 1 and they are true standardized values of components.

What in statistical data analysis is called principal component coefficient matrix $\bf B$, and if it is computed from complete p x p and not anyhow rotated loading matrix, that in machine learning literature is often labelled the (PCA-based) whitening matrix, and the standardized principal components are recognized as "whitened" data.

Computation of $\bf B$ in common Factor analysis

Unlike component scores, factor scores are never exact; they are only approximations to the unknown true values $\bf F$ of the factors. This is because we don't know values of communalities or uniquenesses on case level, - since factors, unlike components, are external variables separate from the manifest ones, and having their own, unknown to us distribution. Which is the cause of that factor score indeterminacy. Note that the indeterminacy problem is logically independent on the quality of the factor solution: how much a factor is true (corresponds to the latent what generates data in population) is another issue than how much respondents' scores of a factor are true (accurate estimates of the extracted factor).

Since factor scores are approximations, alternative methods to compute them exist and compete.

Regression or Thurstone's or Thompson's method of estimating factor scores is given by $\bf B=R^{-1} PC = R^{-1} S$, where $\bf S=PC$ is the matrix of structure loadings (for orthogonal factor solutions, we know $\bf A=P=S$). The foundation of regression method is in footnote $^1$.

Note. This formula for $\bf B$ is usable also with PCA: it will give, in PCA, the same result as the formulas cited in the previous section.

In FA (not PCA), regressionally computed factor scores will appear not quite "standardized" - will have variances not 1, but equal to the $\frac {SS_{regr}}{(n-1)}$ of regressing these scores by the variables. This value can be interpreted as the degree of determination of a factor (its true unknown values) by variables - the R-square of the prediction of the real factor by them, and the regression method maximizes it, - the "validity" of computed scores. Picture $^2$ shows the geometry. (Please note that $\frac {SS_{regr}}{(n-1)}$ will equal the scores' variance for any refined method, yet only for regression method that quantity will equal the proportion of determination of true f. values by f. scores.)

As a variant of regression method, one may use $\bf R^*$ in place of $\bf R$ in the formula. It is warranted on the grounds that in a good factor analysis $\bf R$ and $\bf R^*$ are very similar. However, when they are not, especially when the number of factors m is less than the true population number, the method produces strong bias in scores. And you should not use this "reproduced R regression" method with PCA.

PCA's method, also known as Horst's (Mulaik) or ideal(ized) variable approach (Harman). This is regression method with $\bf \hat R$ in place of $\bf R$ in its formula. It can be easily shown that the formula then reduces to $\bf B= (P^+)'$ (and so yes, we actually don't need to know $\bf C$ with it). Factor scores are computed as if they were component scores.

[Label "idealized variable" comes from the fact that since according to factor or component model the predicted portion of variables is $\bf \hat X = FP'$, it follows $\bf F= (P^+)' \hat X$, but we substitute $\bf X$ for the unknown (ideal) $\bf \hat X$, to estimate $\bf F$ as scores $\bf \hat F$; we therefore "idealize" $\bf X$.]

Please note that this method is not passing off PCA component scores for factor scores, because loadings used are not PCA's loadings but factor analysis'; only that the computation approach for scores mirrors that in PCA.

Bartlett's method. Here, $\bf B'=(P'U_2^{-1}P)^{-1} P' U_2^{-1}$. This method seeks to minimize, for every respondent, varince across p unique ("error") factors. Variances of the resultant common factor scores will not be equal and may exceed 1.

Anderson-Rubin method was developed as a modification of the previous. $\bf B'=(P'U_2^{-1}RU_2^{-1}P)^{-1/2} P'U_2^{-1}$. Variances of the scores will be exactly 1. This method, however, is for orthogonal factor solutions only (for oblique solutions it will yield still orthogonal scores).

McDonald-Anderson-Rubin method. McDonald extended Anderson-Rubin over to the oblique factors solutions as well. So this one is more general. With orthogonal factors, it actually reduces to Anderson-Rubin. Some packages probably may use McDonald's method while calling it "Anderson-Rubin". The formula is: $\bf B= R^{-1/2} GH' C^{1/2}$, where $\bf G$ and $\bf H$ are obtained in $\text{svd} \bf (R^{1/2}U_2^{-1}PC^{1/2}) = G \Delta H'$. (Use only first m columns in $\bf G$, of course.)

Green's method. Uses the same formula as McDonald-Anderson-Rubin, but $\bf G$ and $\bf H$ are computed as: $\text{svd} \bf (R^{-1/2}PC^{3/2}) = G \Delta H'$. (Use only first m columns in $\bf G$, of course.) Green's method doesn't use commulalities (or uniquenesses) information. It approaches and converges to McDonald-Anderson-Rubin method as variables' actual communalities become more and more equal. And if applied to loadings of PCA, Green returns component scores, like native PCA's method.

Krijnen et al method. This method is a generalization which accommodates both previous two by a single formula. It probably doesn't add any new or important new features, so I'm not considering it.

Comparison between the refined methods.

Regression method maximizes correlation between factor scores and unknown true values of that factor (i.e. maximizes the statistical validity), but the scores are somewhat biased and they somewhat incorrectly correlate between factors (e.g., they correlate even when factors in a solution are orthogonal). These are least-squares estimates.
PCA's method is also least squares, but with less statistical validity. They are faster to compute; they are not often used in factor analysis nowadays, due to computers. (In PCA, this method is native and optimal.)
Bartlett's scores are unbiased estimates of true factor values. The scores are computed to correlate accurately with true, unknown values of other factors (e.g. not to correlate with them in orthogonal solution, for example). However, they still may correlate inaccurately with factor scores computed for other factors. These are maximum-likelihood (under multivariate normality of $\bf X$ assumption) estimates.
Anderson-Rubin / McDonald-Anderson-Rubin and Green's scores are called correlation preserving because are computed to correlate accurately with factor scores of other factors. Correlations between factor scores equal the correlations between the factors in the solution (so in orthogonal solution, for instance, the scores will be perfectly uncorrelated). But the scores are somewhat biased and their validity may be modest.

Check this table, too:

[A note for SPSS users: If you are doing PCA ("principal components" extraction method) but request factor scores other than "Regression" method, the program will disregard the request and will compute you "Regression" scores instead (which are exact component scores).]

References

Grice, James W. Computing and Evaluating Factor Scores // Psychological Methods 2001, Vol. 6, No. 4, 430-450.
DiStefano, Christine et al. Understanding and Using Factor Scores // Practical Assessment, Research & Evaluation, Vol 14, No 20
ten Berge, Jos M.F.et al. Some new results on correlation-preserving factor scores prediction methods // Linear Algebra and its Applications 289 (1999) 311-318.
Mulaik, Stanley A. Foundations of Factor Analysis, 2nd Edition, 2009
Harman, Harry H. Modern Factor Analysis, 3rd Edition, 1976
Neudecker, Heinz. On best affine unbiased covariance-preserving prediction of factor scores // SORT 28(1) January-June 2004, 27-36

$^1$ It can be observed in multiple linear regression with centered data that if $F=b_1X_1+b_2X_2$, then covariances $s_1$ and $s_2$ between $F$ and the predictors are:

$s_1=b_1r_{11}+b_2r_{12}$,

$s_2=b_1r_{12}+b_2r_{22}$,

with $r$s being the covariances between the $X$s. In vector notation: $\bf s=Rb$. In regression method of computing factor scores $F$ we estimate $b$s from true known $r$s and $s$s.

$^2$ The following picture is both pictures of here combined in one. It shows the difference between common factor and principal component. Component (thin red vector) lies in the space spanned by the variables (two blue vectors), white "plane X". Factor (fat red vector) overruns that space. Factor's orthogonal projection on the plane (thin grey vector) is the regressionally estimated factor scores. By the definition of linear regression, factor scores is the best, in terms of least squares, approximation of factor available by the variables.

enter image description here

Solved – How to calculate principal components scores in a compositional data PCA

I don't know exactly what robCompositions is doing. Remember, that this package is applying robust statistics.

If you want, you can calculate the first principal coordinate using the coda.base package, or calculate this orthornormal coordinates from the clr-coordinates.

Using `coda.base`

# I am using the first three variables of iris dataset as a composition
X = iris[,1:3]

library(coda.base)
H_pc = coordinates(X, 'pc')

# You have the first coordinate in
H_pc[,1]

The basis is obtained directly from the principal coordinate basis.

attr(H_pc, 'basis')

## Which is equivalent to
H_clr = coordinates(X, 'clr')
prcomp(H_clr)

Without package

To obtain the first principal component you can use the following code

lX = log(X)
clrX = lX - rowMeans(lX)
B = prcomp(clrX)
as.matrix(clrX) %*% B$rotation[,1]