Solved – Do you standardize the data before PCA whitening

datasetmachine learningnormalizationpcasparse

I have a data set ranged in different scales as well as some variables are sparse, for example,

n   V1  V2  V3      V4
0   0   1   34123   51523453
1   16  0   63124   34351234
2   0   0   63431   2343423
3   100 2   64351   34243
4   0   2   75283   35253523
5   0   1   2234    23423523
6   0   0   134523  315345
…   …   …   …       …   

Because of the sparsity, I think I need to reduce the data dimension.
Because of the different range, I would need to normalize the data.

To achieve these two goals, my original plan is to perform PCA whitening.

In the new decorrelated space, I would choose some eigenvectors associated with the first 2-3 largest eigenvalues as my principal vectors and reduce the dimension by projecting onto these vectors.

I think PCA whitening already normalizes the data in zero-mean and unit-variance manner.

I have two questions:

  1. Is it necessary to perform the normalization (e.g., subtract mean and divide by standard deviation independently) before performing the whitening?

  2. What other normalization techniques are worth to try?

Thanks in advance!!

Best Answer

You should probably standardize your data before PCA.

PCA involves projecting the data onto the eigenvectors of the covariance matrix. If you don't standardize your data first, these eigenvectors will be all different lengths. Then the eigenspace of the covariance matrix will be "stretched", leading to similarly "stretched" projections. See here for an example of this effect. This is not what you want. See also here for several good answers describing the geometry of PCA.

However, there are situations in which you do want to preserve the original variances. See here for discussion on that topic.

As for your follow-up question, of whether you will lose dependencies between variables if you apply standardized independently: the answer is no. In fact, correlation between un-standardized random variables is equivalent to the covariance of standardized random variables.

Do note that covariance is inherently a measure of linear association. The covariance between a uniform random variable on $[-1, 1]$ and its square, for example, should be exactly 0. So higher-order relationships between variables could in fact be discarded by PCA. This is one motivation for kernel PCA.