Solved – How does centering make a difference in PCA (for SVD and eigen decomposition)

centeringeigenvaluespcarsvd

What difference does centering (or de-meaning) your data make for PCA? I've heard that it makes the maths easier or that it prevents the first PC from being dominated by the variables' means, but I feel like I haven't been able to firmly grasp the concept yet.

For example, the top answer here How does centering the data get rid of the intercept in regression and PCA? describes how not centering would pull the first PCA through the origin, rather than the main axis of the point cloud. Based on my understanding of how the PC's are obtained from the covariance matrix's eigenvectors, I can't understand why this would happen.

Moreover, my own calculations with and without centering seem to make little sense.

Consider the setosa flowers in the iris dataset in R. I calculated the eigenvectors and eigenvalues of the sample covariance matrix as follows.

data(iris)
df <- iris[iris$Species=='setosa',1:4]
e <- eigen(cov(df))
> e
$values
[1] 0.236455690 0.036918732 0.026796399 0.009033261

$vectors
            [,1]       [,2]       [,3]        [,4]
[1,] -0.66907840  0.5978840  0.4399628 -0.03607712
[2,] -0.73414783 -0.6206734 -0.2746075 -0.01955027
[3,] -0.09654390  0.4900556 -0.8324495 -0.23990129
[4,] -0.06356359  0.1309379 -0.1950675  0.96992969

If I center the dataset first, I get exactly the same results. This seems quite obvious, since centering does not change the covariance matrix at all.

df.centered <- scale(df,scale=F,center=T)
e.centered<- eigen(cov(df.centered))
e.centered

The prcomp function results in exactly this eigenvalue-eigenvector combination as well, for both the centered and uncentered dataset.

p<-prcomp(df)
p.centered <- prcomp(df.centered)
Standard deviations:
[1] 0.48626710 0.19214248 0.16369606 0.09504347

Rotation:
                     PC1        PC2        PC3         PC4
Sepal.Length -0.66907840  0.5978840  0.4399628 -0.03607712
Sepal.Width  -0.73414783 -0.6206734 -0.2746075 -0.01955027
Petal.Length -0.09654390  0.4900556 -0.8324495 -0.23990129
Petal.Width  -0.06356359  0.1309379 -0.1950675  0.96992969

However, the prcomp function has the default option center = TRUE. Disabling this option results in the following PC's for the uncentered data (p.centered remains the same when center is set to false):

p.uncentered <- prcomp(df,center=F)
> p.uncentered
Standard deviations:
[1] 6.32674700 0.22455945 0.16369617 0.09766703

Rotation:
                    PC1         PC2        PC3         PC4
Sepal.Length -0.8010073  0.40303704  0.4410167  0.03811461
Sepal.Width  -0.5498408 -0.78739486 -0.2753323 -0.04331888
Petal.Length -0.2334487  0.46456598 -0.8317440 -0.19463332
Petal.Width  -0.0395488  0.04182015 -0.1946750  0.97917752

Why is this different from my own eigenvector calculations on the covariance matrix of the uncentered data? Does it have to do with the calculation? I've seen mentioned that prcomp uses something called the SVD method rather than the eigenvalue decomposition to calculate the PC's. The function princomp uses the latter, but its results are identical to prcomp. Does my issue relate to the answer I described at the top of this post?

EDIT: Issue was cleared up by the helpful @ttnphns. See his comment below, on this question: What does it mean to compute eigenvectors of a covariance matrix if the data were not centered first? and in this answer: https://stats.stackexchange.com/a/22520/3277. In short: a covariance matrix implicitly involves centering of the data already. PCA uses either SVD or eigendecomposition of the centered data $\bf X$, and the covariance matrix is then equal to ${\bf X'X}/(n-1)$.

Best Answer

As you remarked yourself and as explained by @ttnphns in the comments, computing covariance matrix implicitly performs centering: variance, by definition, is the average squared deviation from the mean. Centered and non-centered data will have identical covariance matrices. So if by PCA we understand the following procedure: $$\mathrm{Data}\to\text{Covariance matrix}\to\text{Eigen-decomposition},$$ then centering does not make any difference.

[Wikipedia:] To find the axes of the ellipse, we must first subtract the mean of each variable from the dataset to center the data around the origin. Then, we compute the covariance matrix of the data...

And so you are right to observe that this is not a very accurate formulation.

When people talk about "PCA on non-centered data", they mean that instead of covariance matrix the eigen-decomposition is performed on the $\mathbf X^\top \mathbf X/(n-1)$ matrix. If $\mathbf X$ is centered then this will be exactly the covariance matrix. If not then not. So if by PCA we understand the following procedure:

$$\text{Data } \mathbf X\to\text{Matrix } \mathbf X^\top \mathbf X/(n-1)\to\text{Eigen-decomposition},$$

then centering matters a lot and has the effect described and illustrated by @ttnphns in How does centering the data get rid of the intercept in regression and PCA?

It might seem weird to even mention this "strange" procedure, however consider that PCA can be very conveniently performed via singular value decomposition (SVD) of the data matrix $\mathbf X$ itself. I describe this in detail here: Relationship between SVD and PCA. How to use SVD to perform PCA? In this case the procedure is as follows:

$$\text{Data } \mathbf X \to \text{Singular value decomposition}.$$

If $\mathbf X$ is centered then this is equivalent to standard PCA done via covariance matrix. But if not, then it's equivalent to the "non-centered" PCA as described above. Since SVD is a very common and very convenient way to perform PCA, in practice it can be quite important to remember to center the data before calling svd function. I certainly had my share of bugs because of forgetting to do it.

Related Question