Solved – Should the correlation PCA projection be computed on original or normalized samples

data transformationpcastandardization

Suppose we compute the correlation PCA of a dataset $X$ (with $m$ variables and $n$ observations) by first normalizing the input variables. That is: mean -> 0 and standard deviation -> 1. Let us assume for the sake of this question that $\mu_i=0$ for our dataset. In that case we only need to normalize the standard deviation:

$$X'_{i,j}={X_{i,j}\over \sigma_i}$$

Once the correlation matrix $X'X'^T$ is computed, we calculate its SVD which provides us with the eigenvectors $U$.

To rotate/transform the input points in accordance with the eigenvectors we multiply them with $U^T$. My question now is do we perform this on the original input samples ($X$) or on the normalized samples ($X'$) ?

Best Answer

Use normalized variables.

PCA is an explorative method: every analysis choice (such as to center or not, to standardize or not, to normalize to unit variance or in some other way, etc.), is possible and can perhaps make sense in some specific situation. No recommendation is absolute.

Nevertheless, let us think of a typical situation.

PCA on correlation matrix is usually used when the variables are of different scale and because we believe that the normalized data cloud is a more meaningful representation of the dataset than the un-normalized cloud. If so, then it stands to reason to use the normalized data for PCA projection, and not only for PCA eigenvectors computation.

In addition, note that if you project un-normalized data on the eigenvectors of the normalized covariance matrix (i.e. correlation matrix), you will get correlated projections. In PCA we are used to uncorrelated principal components, so having correlated projections almost seems to defy the whole purpose of the method.

As an example, consider the wine dataset, encompassing 178 wines of 3 different grape varieties measured along the 13 variables. Left: 2D PCA projection using the covariance matrix. Middle: 2D PCA projection using the correlation matrix. Right: un-normalized (but centered) variables projected onto the correlation matrix eigenvectors.

PCA covariance correlation

It is pretty obvious that the middle projection is the most meaningful one, whereas the right one does not make a lot of sense at all.