Solved – Should we center original data if we want to get principal component

machine learningpcapythonscikit learn

Suppose we have data matrix $X$ with shape $n*p$, each row $x_i^T$ is a sample. By definition first principal component is $y_1 = e_1^T * x$, where $e_1$ is unit eigenvector corresponding to largest eigenvalue of sample covariance matrix. But in sklearn, when using pca to transform $X$ to get principal component, it centers $X$ first in the function transform, why?

from sklearn.decomposition import PCA
pca = PCA()
pca.fit(X)
X_new = X - X.mean(axis=0)
pca.tranform(X) == X_new @ pca.components_.T

Best Answer

While it is true that your original data can be reconstructed from the principal components, even if you didn't center the data when calculating them, part of what one is usually trying to do in principal components analysis is dimensionality reduction. That is you want to find a subset of the principal components that captures most of the variation in the data. This happens when the variance of the coefficients of the principal components is small for all of the components after the first few. For that to happen, the centroid of the cloud of data points has to be at the origin, which is equivalent to centering the data.

Here's a 2D example to illustrate. Consider the following dataset: enter image description here

This data is nearly one-dimensional, and would be well-represented by a single linear component. However, because the data does not pass through the origin, you can't describe it with a scalar multiplied by a single principal component vector (because a linear combination of a single vector always passes through the origin). Centering the data translates this cloud of points so that its centroid is at the origin, making it possible to represent the line running down the middle of the cloud with a single principal component.

You can see the difference if you try running the PCA with and without the centering. With centering:

> prcomp(m, centering=TRUE)
Standard deviations (1, .., p=2):
[1] 2.46321136 0.04164508

Rotation (n x k) = (2 x 2):
         PC1        PC2
x -0.4484345 -0.8938157
y -0.8938157  0.4484345

The singular value for the second component (0.04) is much smaller than that of the first (2.46), indicating that most of the variation in the data is accounted for by the first component. We could reduce the dimensionality of the dataset from 2 to 1 by dropping the second component.

If, on the other hand, we don't center the data, we get a less useful result:

> prcomp(m, center=FALSE)
Standard deviations (1, .., p=2):
[1] 6.240952 1.065940

Rotation (n x k) = (2 x 2):
          PC1         PC2
x -0.04988157  0.99875514
y -0.99875514 -0.04988157

In this case, the singular value for the second component is smaller than that of the first component, but not nearly as much so as when we centered the data. In this case, we probably wouldn't get an adequate reconstruction of the data using just the first component and dropping the second. Thus, the uncentered version of the calculation is not useful for dimensionality reduction.

Related Question