Solved – How to use PCA to estimate the variance-covariance matrix

covariance-matrixpca

I am working on alternative ways for the estimation of variance-covariance matrices. For this I have already estimated the sample variance-covariance matrix, single index covariance matrix. I also want to estimate the covariance matrix by principal component analysis (PCA). As I have 5 different types of asset returns and factors which are assumed to drive these returns are 6 in numbers like (Inflation, interest rate etc).

Kindly guide me what is the procedure to estimate this covariance matrix by PCA.

Best Answer

Quoting from the link in the above question, the methodology is as follow

principal component analysis (PCA) can be used to determine the underlying drivers of the stock returns. The PCA method transforms the vector space of N assets into another vector space of N factors by singular value decomposition (SVD) of the sample covariance matrix. Each factor, an eigenvector from the SVD, represents a linear combination of the original N assets, and the factors are uncorrelated by definition, with variances equal to the eigenvalues from the SVD.

Asset returns and sample covariance matrix can be written as

$$ R_i^e = \beta_{i,1}F_{1} + \beta_{i,2}F_{2} + \cdots + \beta_{i,N}F_{N} \\ \hat{\Sigma} = \beta D_{F} \beta^{T} $$

Where $\beta$ represents N columns of eigenvectors, and $D_F$ is the N by N diagonal matrix of eigenvalues.

PCA is often employed to reduce dimensionality of the data. If the first L factors govern most of the variability of the asset returns, i.e. if $\frac{\Sigma_{l=0}^{L} \sigma_{F,l}^2}{\Sigma_{l=0}^{N} \sigma_{F,l}^2}$ is very close to 1, then the last N-L factors shall be dropped,

$$ \hat{\Sigma} = \tilde{\beta}\tilde{D_F}\tilde{\beta}^T + D_{\epsilon} $$

Where $\tilde{\beta}$ is the N-asset by L-factor matrix of factor loadings (first L eigenvectors), $\tilde{D_F}$ is the L by L diagonal matrix of the first L eigenvalues, and $D_{\epsilon}$ is the N-asset by N-asset diagonal matrix of variances of idiosyncratic components not explained by the first L factors.

See Chapter 8 of Professor Jorion’s “Value At Risk” for more details.

This technique is often used when the number of assets N is close to the number samples T, leading to spurious correlations in the sample covariance and when N > T, a sample covariance matrix which is singular.

Example

As a concrete example, here is an implementation in R for returns generated from the 1 factor model

$$ R_{t} = m_{t}\beta + \epsilon_{t} $$

where $R_{t}$ is an Nx1 vector of returns at time t, $m_t$ is the market return at time t, $\beta$ are the Nx1 betas of the assets to the market return and $\epsilon_{t}$ is Nx1 gaussian noise at time t

set.seed(42)
N <- 15
T <- 30
mvol <- 0.8

market.betas <- runif(N, 0, 2)
market.factor <- rnorm(T, 0, sd=mvol)
epsilon <- matrix(rnorm(N*T, 0, sd=1), ncol=N)

equity.rets <- market.factor %*% t(market.betas) + epsilon
sample.cov <- cov(equity.rets)
prs <- prcomp(equity.rets)

Keeping all the factors, we can reconstruct the sample variance exactly (modulo machine precision)

sum(abs(sample.cov - prs$rotation %*% diag(prs$sdev^2) %*% t(prs$rotation)))
[1] 8.925881e-13

Or we can drop PCs with less variance. A detailed answer discussing this is Relationship between SVD and PCA. Here we choose only the first PC, with the omniscience that this is a 1 factor model.

eigs <- prs$sdev^2
eigs[-1] <- 0

pca1.cov <- prs$rotation %*% diag(eigs) %*% t(prs$rotation)

Comparing the PCA covariance and sample covariance to the model covariance, $Var(m_{t}\beta)$, we can see improvements across a variety of distance metrics.

model.cov <- mvol^2 * market.betas %*% t(market.betas)

d1 <- function(m1, m2){sum(abs(m1 - m2))}
d2 <- function(m1, m2){sum((m1 - m2)^2)}
dinf <- function(m1, m2){max(abs(m1 - m2))}
dist <- data.frame(
          c(d1(model.cov, sample.cov), d1(model.cov, pca1.cov)),
          c(d2(model.cov, sample.cov), d2(model.cov, pca1.cov)),
          c(dinf(model.cov, sample.cov), dinf(model.cov, pca1.cov))
)
colnames(dist) <- c("d1", "d2", "dinf")
rownames(dist) <- c("sample cov", "pca1 cov")
dist
                 d1       d2      dinf
sample cov 74.25255 42.26942 1.6620401
pca1 cov   52.13983 18.74075 0.8036362