Solved – Does using a covariance matrix of scaled and centered variables compare with using a correlation matrix

correlation matrixcovariance-matrixpca

I have some data with features which have different units of measurements. Here, by data, I'm trying to say that the row represents the observations and column the features. There are correlations between the features as well. Hence, principal component analysis would be the best to explain the variance in my data and also identify the variance contribution of each feature in the total variance explained by a principal component.

I read somewhere that principal components analysis (PCA) can be performed on the data by computing both covariance and correlation matrices. If I scale and center my data (Z-scores) and compute covariance and compare it with a correlation matrix, will the results of PCA be different? As both correlation matrix and scale/centered covariances represents standardization of the data, I assume the results (variance contributed by features in a principal component) should be same? Am I wrong in assuming this? I'm a novice and am trying to understand principal component analysis.

Best Answer

you will find a nice summary given by user @ttnphns here: https://stats.stackexchange.com/q/22520.

In particular:

If you center columns (variables) of $\mathbf{A}$, then $\mathbf{A′A}$ is the scatter (or co-scatter, if to be rigorous) matrix and $\mathbf{A′A}/(n−1)$ is the covariance matrix.
If you z-standardize columns of a matrix $\mathbf{A}$ (subtract the column mean and divide by the standard deviation), then $\mathbf{AA′}/(n−1)$ is the Pearson correlation matrix: correlation is covariance for standardized variables.

In general you should always center your data when performing PCA. As explained here, not centering your data can give misleading results.

Related Solutions

Solved – How to use PCA to estimate the variance-covariance matrix

Quoting from the link in the above question, the methodology is as follow

principal component analysis (PCA) can be used to determine the underlying drivers of the stock returns. The PCA method transforms the vector space of N assets into another vector space of N factors by singular value decomposition (SVD) of the sample covariance matrix. Each factor, an eigenvector from the SVD, represents a linear combination of the original N assets, and the factors are uncorrelated by definition, with variances equal to the eigenvalues from the SVD.

Asset returns and sample covariance matrix can be written as

$$ R_i^e = \beta_{i,1}F_{1} + \beta_{i,2}F_{2} + \cdots + \beta_{i,N}F_{N} \\ \hat{\Sigma} = \beta D_{F} \beta^{T} $$

Where $\beta$ represents N columns of eigenvectors, and $D_F$ is the N by N diagonal matrix of eigenvalues.

PCA is often employed to reduce dimensionality of the data. If the first L factors govern most of the variability of the asset returns, i.e. if $\frac{\Sigma_{l=0}^{L} \sigma_{F,l}^2}{\Sigma_{l=0}^{N} \sigma_{F,l}^2}$ is very close to 1, then the last N-L factors shall be dropped,

$$ \hat{\Sigma} = \tilde{\beta}\tilde{D_F}\tilde{\beta}^T + D_{\epsilon} $$

Where $\tilde{\beta}$ is the N-asset by L-factor matrix of factor loadings (first L eigenvectors), $\tilde{D_F}$ is the L by L diagonal matrix of the first L eigenvalues, and $D_{\epsilon}$ is the N-asset by N-asset diagonal matrix of variances of idiosyncratic components not explained by the first L factors.

See Chapter 8 of Professor Jorion’s “Value At Risk” for more details.

This technique is often used when the number of assets N is close to the number samples T, leading to spurious correlations in the sample covariance and when N > T, a sample covariance matrix which is singular.

Example

As a concrete example, here is an implementation in R for returns generated from the 1 factor model

$$ R_{t} = m_{t}\beta + \epsilon_{t} $$

where $R_{t}$ is an Nx1 vector of returns at time t, $m_t$ is the market return at time t, $\beta$ are the Nx1 betas of the assets to the market return and $\epsilon_{t}$ is Nx1 gaussian noise at time t

set.seed(42)
N <- 15
T <- 30
mvol <- 0.8

market.betas <- runif(N, 0, 2)
market.factor <- rnorm(T, 0, sd=mvol)
epsilon <- matrix(rnorm(N*T, 0, sd=1), ncol=N)

equity.rets <- market.factor %*% t(market.betas) + epsilon
sample.cov <- cov(equity.rets)
prs <- prcomp(equity.rets)

Keeping all the factors, we can reconstruct the sample variance exactly (modulo machine precision)

sum(abs(sample.cov - prs$rotation %*% diag(prs$sdev^2) %*% t(prs$rotation)))
[1] 8.925881e-13

Or we can drop PCs with less variance. A detailed answer discussing this is Relationship between SVD and PCA. Here we choose only the first PC, with the omniscience that this is a 1 factor model.

eigs <- prs$sdev^2
eigs[-1] <- 0

pca1.cov <- prs$rotation %*% diag(eigs) %*% t(prs$rotation)

Comparing the PCA covariance and sample covariance to the model covariance, $Var(m_{t}\beta)$, we can see improvements across a variety of distance metrics.

model.cov <- mvol^2 * market.betas %*% t(market.betas)

d1 <- function(m1, m2){sum(abs(m1 - m2))}
d2 <- function(m1, m2){sum((m1 - m2)^2)}
dinf <- function(m1, m2){max(abs(m1 - m2))}
dist <- data.frame(
          c(d1(model.cov, sample.cov), d1(model.cov, pca1.cov)),
          c(d2(model.cov, sample.cov), d2(model.cov, pca1.cov)),
          c(dinf(model.cov, sample.cov), dinf(model.cov, pca1.cov))
)
colnames(dist) <- c("d1", "d2", "dinf")
rownames(dist) <- c("sample cov", "pca1 cov")
dist
                 d1       d2      dinf
sample cov 74.25255 42.26942 1.6620401
pca1 cov   52.13983 18.74075 0.8036362

Solved – Using shrinkage when estimating covariance matrix before doing PCA

The paper you cited (Donoho et al. 2013 Optimal Shrinkage of Eigenvalues in the Spiked Covariance Model) is an impressive piece of work which I confess I did not really study. Nevertheless, I believe that it is easy to see that an answer to your question is negative: using any kind of shrinkage estimator of the covariance matrix will not improve your PCA results and, specifically, will not lead to "better understanding of the structure in the data".

In a nutshell, this is because shrinkage estimators only affect the eigenvalues of the sample covariance matrix and not the eigenvectors.

Let me quote the beginning of the abstract of Donoho et al.:

Since the seminal work of Stein (1956) it has been understood that the empirical covariance matrix can be improved by shrinkage of the empirical eigenvalues. In this paper, we consider a proportional-growth asymptotic framework with $n$ observations and $p_n$ variables having limit $p_n/n \to \gamma \in (0,1]$. We assume the population covariance matrix $\Sigma$ follows the popular spiked covariance model, in which several eigenvalues are significantly larger than all the others, which all equal $1$. Factoring the empirical covariance matrix $S$ as $S = V \Lambda V'$ with $V$ orthogonal and $\Lambda$ diagonal, we consider shrinkers of the form $\hat{\Sigma} = \eta(S) = V \eta(\Lambda) V'$ where $\eta(\Lambda)_{ii} = \eta(\Lambda_{ii})$ is a scalar nonlinearity that operates individually on the diagonal entries of $\Lambda$.

The abstract goes on to describe paper's contributions, but what is important for us here is that the sample covariance matrix $S$ and its shrinked version $\hat\Sigma$ have the same eigenvectors. Principal components are given by projections of the data onto these eigenvectors; so they will not be affected by the shrinkage.

The only thing that can get affected are the estimates of how much variance is explained by each PC because these are given by the eigenvalues. (And as @Aksakal wrote in the comments, this can affect the number of retained PCs.) But the PCs themselves will not change.