Solved – Princomp() outputs seemingly wrong PCA scores with cor=TRUE input argument

pcar

Given a 2D data frame d that is centered and scaled:

d = data.frame(x = c(1,2,3,4,5,6,7,8,9,10), y = c(3,5,6,8,3,9,3,5,7,15))
d = as.data.frame(scale(d,center=TRUE, scale = TRUE))

the correlation matrix and the covariance matrix are the same:

all.equal(cov(d),cor(d)) # this equals TRUE, meaning cov(d) and cor(d) are equal

Now if you use princomp to do PCA and compare the loadings/eigenvectors produced when you specify to use the covariance matrix (cor=FALSE) vs. the correlation matrix (cor=TRUE) you get the same loadings as expected:

princomp(d, cor = TRUE)$loading
princomp(d, cor = FALSE)$loading

but the scores that are produced are not the same:

princomp(d, cor = TRUE)$scores
princomp(d, cor = FALSE)$scores

head(princomp(d, cor = TRUE)$scores,3)
#         Comp.1     Comp.2
# [1,] 1.7950077 -0.4206391
# [2,] 1.1445987 -0.5786822
# [3,] 0.6963027 -0.5346122

head(princomp(d, cor = FALSE)$scores,3)
#         Comp.1     Comp.2
# [1,] 1.7028938 -0.3990533
# [2,] 1.0858616 -0.5489861
# [3,] 0.6605707 -0.5071777

Why?

PCA scores are just the original data, d, multiplied by the eigenvectors (i.e. loadings). The d is the same in both calls to princomp and the matrix being decomposed is the same since cov(d)=cor(d) so why the difference in scores?

Calculated manually the original data times the eigenvectors matches the output when cor=FALSE:

head(as.matrix(d) %*% as.matrix(princomp(d)$loading),3)

Trying to understand the source code

To see what is happening look at the the code for the princomp() function by typing

getAnywhere("princomp.default")

I reconstructed the source code to see what is happening when cor=TRUE vs. when cor=FALSE. Here is the source code moved around so you can match the output cor=TRUE vs. cor=FALSE:

z = as.matrix(d)
sd_of_original_data = apply(d,2,sd)  # this is 1 because the data was scaled
covmat <- cov.wt(z)
n.obs <- covmat$n.obs
cv <- covmat$cov * (1 - 1/n.obs)
cen <- covmat$center

# if cor = TRUE code does this: i.e CORRELATION MATRIX code
sds <- sqrt(diag(cv))
sds   # this is .9486833 for both columns
cv_cor <- cv/(sds %o% sds)
cv_cor
edc_cor <- eigen(cv_cor, symmetric = TRUE)
ev_cor <- edc$values
sc_cor = sds
scr_cor = scale(z, center = cen, scale = sc_cor) %*% edc_cor$vectors
# reconstructed scores using source code match princomp output
scr_cor
princomp(d, center = FALSE, scale = FALSE, cor= TRUE)$scores

# if cor = FALSE code does this: i.e COVARAINCE MATRIX code
edc <- eigen(cv, symmetric = TRUE)
ev <- edc$values
sdev <- sqrt(ev)
sc = rep(1, ncol(cv)) #, colnames(cv)
scr = scale(z, center=cen, scale=sc) %*% edc$vectors
# reconstructed scores using source code match princomp output
scr
princomp(d, center=FALSE, scale=FALSE, cor=FALSE)$scores

First note that the source code has this division

 covmat$cov * (1 - 1/n.obs)

and note covmat$cov has ones on the diagonal, i.e. sd(covmat$cov) = 1, but once you multiply by* (1 - 1/n.obs) then sd(covmat$cov) != 1. This is the issue. Details below.

  1. When cor=TRUE the cv (covariance matrix) is divided by the outer product of the matrix that is the cv/(sds %o% sds). Note sds is the sqrt(diag(cv)) and sds are .94 for both columns of data. The sd_of_original_data has sds of 1 for both columns because the original data was standardized. When cor=FALSE this division does not take place.

    Notice when the data is centered and scaled diag(cv) will not be 1s because of the original choice to do this: cv <- covmat$cov * (1 - 1/n.obs). Above when cor= TRUE, sds= diag(cv) returns non-one values whereas when cor=FALSE sds equals a vector of 1s. This "sds" vector which is a function of cv <- covmat$cov * (1 - 1/n.obs) is the reason the scores are different because below the scores are scaled by sds.

  2. When cor=TRUE the scores are created from scaling the original data matrix, z, by sc_cor which are the sds. Then multiplying by the eigenvectors, edc_cor which remember were computed using cv_cor which had the division in step 1.

    sc_cor =sds
    scr_cor = scale(z, center=cen, scale=sc_cor) %*% edc_cor$vectors
    

    vs. when cor = FALSE the scores are created by scaling the original data matrix by 1 and multiplying by the eigenvectors, edc of cv. This seems to be the textbook PCA score calculation.

    sc = rep(1, ncol(cv)) #, colnames(cv)
    scr = scale(z, center = cen, scale = sc) %*% edc$vectors
    

    So initially I thought if cor(d) = cov(d) the PCA loadings and scores would be the same. Obviously the covariance matrix you think princomp is using is sometimes not actually used. Instead it is modified.

    It seems to avoid all of this confusion you can always pass the covariance matrix you want to use in the "covmat" parameter to princomp. Then you will have control over the covariance matrix used in the PCA. If you don't want the cv scaled by (1 – 1/n.obs) you can just pass the proper cv in "covmat".

    Notice there are different ways to scale the covariance matrix. Here is help of the cov.wt() used in the source code.

    By default, method = "unbiased", The covariance matrix is divided by one minus the sum of squares of the weights, so if the weights are the default (1/n) the conventional unbiased estimate of the covariance matrix with divisor (n – 1) is obtained. This differs from the behaviour in S-PLUS which corresponds to method = "ML" and does not divide.

Also a question. The source code has scaling of the covariance matrix for weights.

 cv <- covmat$cov * (1 - 1/n.obs)

but I am confused if you look at getAnywhere("cov.wt") it is already scaling.

Here is the data and the result from cov.wt() and the source code used to calculated cov.wt.   See getAnywhere("cov.wt")

d = data.frame(x = c(1,2,3,4,5,6,7,8,9,10), y=c(3,5,6,8,3,9,3,5,7,15))
d = as.data.frame(scale(d,center=TRUE, scale=TRUE))
cov.wt(d)$cov
##### here is what cov.wt is doing see getAnywhere("cov.wt")
wt = rep(.1,10)
center = colSums(wt * d)
x <- sqrt(wt) * sweep(d, 2, center, check.margin=FALSE)
crossprod(as.matrix(x))/(1 - sum( wt^2 ))  # this is cov.wt(d)$cov

See the denominator in the last line (1 – sum( wt^2 )) … so why the need to the scale this result again by it is already scaling for the weights so by (1 – 1/n.obs)?

See the "Weighted Samples" section here:
https://en.wikipedia.org/wiki/Sample_mean_and_covariance

"If all weights are the same …. the weighted mean and covariance reduce to the sample mean and covariance above."

Essentially (1 – sum( wt^2 )) in the source code of cov.wt() is the same as the (1 – 1/n.obs) term used to multiple cov.wt() in the source code of princomp() because the weights are the same, 1/n (sum(wt^2) == 1/n.obs)…so they are scaling twice?

Best Answer

I do not recommend using princomp() because it is a source of constant confusion. Use prcomp() instead.

Princomp() computes covariance matrix using the $1/n$ factor, instead of the more standard $1/(n-1)$ factor, which means that the eigenvalues returned by princomp and by any other R function such as e.g. prcomp or eigen(cov()) will, very confusingly, differ: Why do the R functions 'princomp' and 'prcomp' give different eigenvalues?

Your issue is very related to that (as correctly identified by @whuber in the comments). When princomp is run with cor=FALSE input argument (which is the default), it computes eigenvectors of the covariance matrix, centers the data, and projects the data onto the eigenvectors. When run with cor=TRUE, the function computes eigenvectors of the correlation matrix, $z$-scores the data using variances computed with $1/n$ factor, and projects the data onto the eigenvectors.

This means that PCA scores produced by the following two lines will be slightly different (because scale uses $n-1$):

 princomp(scale(data))$scores
 princomp(data, cor=T)$scores

I find this counter-intuitive and confusing. As @ttnphns points out in the comments, using $n$ is not necessarily worse than using $n-1$. But princomp's choice is inconsistent with almost all other R functions, leading to great confusion.

If you supply already $z$-scored data and specify cor=TRUE, the function will $z$-score the data again, according to its own liking, by multiplying each column with $\sqrt{(n-1)/n}$. This is the difference that you observe.

(See here about an unrelated algorithmic difference between princomp and prcomp.)


Some comments on the source code

Let us take a look at the code that you provided. The function starts with computing the covariance matrix and then multiplying it with (1 - 1/n.obs). You are over-thinking this; it simply converts $1/(n-1)$ factor into $1/n$ factor, because $1-1/n=(n-1)/n$. So okay, covariance matrix is now scaled to be maximum likelihood (instead of unbiased).

When cor=FALSE (which is the default), the scores are computed as

 scr =  scale(z, center = cen, scale = sc) %*% edc$vectors

which is centered data (non-scaled: sc is equal to 1) multiplied by the eigenvectors.

When cor=TRUE, the scores are computed with the same formula, but now the sc variable is set to sqrt(diag(cv)) where cv is the ML covariance matrix. As you wrote yourself, in your case sc=.9486833, which is $\sqrt{9/10}$. So even though you $z$-scored your data, princomp will re-$z$-score it for you because it prefers different denominator. Hence the confusion.

Related Question