Solved – the meaning of the variable “scores” in MATLAB’s PCA

autoencodersmachine learningpca

I was trying to understand what the score variable was in MATLAB. The PCA documentation says:

Principal component scores are the representations of X in the
principal component space. Rows of score correspond to observations,
and columns correspond to components.

What I find confusing the following:

scores are the representations of X in the principal component space.

since I am not sure what that means precisely. For me (at least from a auto-encoding perspective) the representation of the data $X_N \in \mathbb{R}^{D \times N}$ in the principal component space would be the projection of all the data set points $X_N$ (where the data set points are the columns) on the column space of $U$, the eigenvectors of the covariance matrix $C_N = \frac{1}{N} \sum^{N}_{n=1} (x^{(n)} – \bar{x}) ({x^{(n)}} – \bar{x})^T = \frac{1}{N} (X-\bar{X})(X – \bar{X})^{T}$. Therefore, score should be the best linear combination of the principal components $U$.

For one single data vector $x^{(i)}$ one can notice the following:

$$ a^{(a)} = \left(
\begin{array}{c}
u^T_1 x^{(n)}\\
\vdots \\
u^T_k x^{(n)}\\
\vdots \\
u^T_K x^{(n)}
\end{array}
\right)=U^Tx^{(n)}$$

produces the coefficients of projections onto each principal component. Thus, each component $k$ of $a^{(i)}_k$ tells you how much the data $x^{(i)}$ projects on the direction of eigenvector $u_k$. Thus one can reconstruct one single data as follows:

$$\tilde{x}^{(i)} = \sum^{K}_{k=1} a^{(i)}_k u_k = U a^{(i)} = U U^T x^{(i)} $$

From the above its not to hard to see that the following equation will reconstruct the whole data matrix $X_N$:

$$ \tilde{X}_N = U U^T X_N$$

therefore what occurred to me to understand what the variable score actually represents was to compare it with the above equation. Thus, I wrote the following script that does exactly that:

D = 3
N = 5
X = rand(D, N);
%% process data
x_mean = mean(X, 2); %% computes the mean of the data x_mean = sum(x^(i))
X_centered = X - repmat(x_mean, [1,N]);
%% PCA
[coeff, score, latent, ~, ~, mu] = pca(X'); % coeff =  U
[U, S, V] = svd(X_centered); % coeff = U
%% Reconstruct data
X_tilde_U = U * U'*X
X_tilde_coeff = coeff*coeff'*X
score % unfortunately not the same as the above matrices

unfortunately, I discovered that score was not the same as $\tilde{X}^{(i)}$. What is it though? Thus, the points that I wanted to address were:

  1. What does score actually represents? What is a mathematical and intuitive explanation of what it is?
  2. If I want to use PCA as the tool to reconstruct vectors (or say images) as in a linear auto-encoder (aka PCA) should I use the variable score or should I use what I understand as a reconstruction $ \tilde{X}_N = U U^T X_N$?

After doing some more digging in that documentation I found that one can make what I call a reconstruction with the following code:

X_tilde_score = ( score * coeff' + + repmat(mu,[N,1]) )';

Which translates in equations to:

$$ \tilde{X} = (score U^T + \bar{X})^T$$

where $\bar{X}$ is the concatenation of the mean vector $\bar{x} = \frac{1}{N} \sum^N_{i=1} x^{(i)}$.

After some rearranging one can get:

$$ scores = U^T (\tilde{X} – \bar{X}) = U^T(X – \bar{X})$$

which seems a little weird to me because that is not what I would have called "representations of X in the principal component space". It doesn't even seem to be a projection because it does not even obey $P^2 = P$ (since $U^TU^T$ doesn't make sense as its rectangular). Then I was wondering what were the developers thinking when they defined scores? Why would returning such a thing be good instead of $\tilde{X}$? Is there something about PCA I don't know or that I don't understand and hence, why I miss the purpose of score? Why is it meaningful to define scores that way? (I don't think they "wrong" or its a bad definition, I genuinely want to understand the motivation for such a definition for score)


If it helps to understand my perspective (and why I might be asking this for someone who thinks its an obvious answer) I mostly come from a Machine Learning, Linear Algebra and Computer Science background. In particular, I find auto-encoders interesting right now.

Best Answer

This is a perfectly fine definition based on the resources they cite (eg. Jolliffe, 2002); it is at no point wrong. To your particular questions:

  1. By score they represent the projections $\Xi$ of the centred data in the linear space defined by the eigenvectors $\Phi$. You can immediately check this in your script with something like: all( abs(score) - abs(X_centered' * U) < 2*eps) (I use the abs to ensure we issues with sign).

  2. You can produce the $k$-th dimensional approximation of your centred data by using the score of the $K$ first principal components of them. That is: $\hat{X}^K_{c} = \sum_{i=1}^K \xi_i \phi_i$. Assuming $K=5$ in your script this is plainly (coeff * score') which numerically equates the centred sample: all(abs( X_centered - (coeff * score')) < 2*eps).

I believe that some of your misconception stems from the fact that you say: "score should be the best linear combination of the principal components $U$", but unfortunately this is not the case. The score dictate which it the best linear combination of the principal components $U$ to reconstruct the data in terms of fraction-of-variance-explained but they are not the results of that combination. In terms of PCA, SVD contains only the left singular vectors, $U$ (the eigenvectors of the covariance matrix of $X$) and the singular values S (the square root of the eigenvalues of the covariance matrix of $X$, more information here); nothing about the scores $\Xi$. You will need to project the centred sample $X_c$ using $U$ to get the scores $\Xi$. Inversely, if you know use $\Xi \Phi^T$ you can reconstruct the data back.

To recap: score are the projections of the centred data in the linear space defined by the eigenvectors of the covariance matrix of $X$. This exactly your final result: $\text{scores} = U^T (X - \bar{X})$.

A side-comment: When I started reading on PCA I first try to get the covariance derivation right and then moved to the SVD. I believe that the covariance methodology is a bit easier to follow and somewhat more intuitive in terms of Statistics as well as physical interpretation. Maybe you want to nail that down first and then move to the SVD methodology.

Related Question