Correlation – How Are Eigenvalues/Singular Values Related to Variance (SVD/PCA)

correlationpcasvd

Let $X$ be a data matrix of size $n \times p$.

Assume that $X$ is centered (column means subtracted).
Then, the $p \times p$ covariance matrix is given by $$C = \frac{X^TX}{n-1}$$

Since $C$ is symmetric, it is diagonalizable, hence, $\exists$ a matrix $V$ such that $$C = VLV^T$$where $V$ is a matrix of eigenvectors of C and $L$ is diagonal with eigenvalues $\lambda_i$ of $C$.

Now, I've read things along the lines of: eigenvalues $λ_i$ show variances of the respective PCs.
What does this mean? How is the spread/variance of a covariate related to the eigenvalue?

I understand that we want components with a large variance as large variance means more information (generally), but how does this relate to eigenvalues?

Best Answer

The variance of any $p$-vector $x$ is given by

$$\operatorname{Var}(x) = x^\prime C x.\tag{1}$$

We may write $x^\prime$ as a linear combination of the rows of $V,$ $v_1,$ $v_2,\ldots,$ $v_p,$ because

$$x^\prime = x^\prime\mathbb{I} = x^\prime V V^\prime = (x^\prime V)_1v_1 + (x^\prime V)_2v_2 + \cdots + (x^\prime V)_pv_p.$$

The coefficient of $v_i$ in this linear combination is $(x^\prime V)_i = (V^\prime x)_i.$

The diagonalization permits you to rewrite these relations more simply as

$$\operatorname{Var}(x) = x^\prime(V\Lambda V^\prime) x = \sum_{i=1}^p \lambda_{ii} (V^\prime x)_i^2.$$

In other words, the variance of $x$ is found as the sum of $p$ terms, each obtained by

(a) transforming to $y=V^\prime x,$ then (b) squaring each coefficient $y_i,$ and (c) multiplying the square by $\lambda_{ii}$.

This enables us to understand the action of $C$ in simple terms: $y$ is just another way of expressing $x$ (it uses the row vectors of $V$ as a basis) and its terms contribute their squares to the variance, weighted by $\lambda_{ii}.$

The relationship to PCA is the following. It makes little sense to maximize the variance, because by scaling $x$ we can make the variance arbitrarily large. But if we think of $x$ solely as determining a linear subspace, (if you like, an unsigned direction) we may represent that direction by scaling $x$ to have unit length. Thus, assume $||x||^2=1.$ Because $V$ is an orthogonal matrix, $y$ also has unit length:

$$||y||^2 = y^\prime y = (V^\prime x)^\prime(V^\prime x) = x^\prime(VV^\prime) x = x^\prime \mathbb{I}x = ||x||^2= 1.$$

To make the variance of $x$ as large as possible, you want to put as much weight as possible on the largest eigenvalue (the largest $\lambda_{ii}$). Without any loss of generality you can arrange the rows of $V$ so that this is $\lambda_{11}.$ A variance-maximizing vector therefore is $y^{(1)} = (1,0,\ldots,0)^\prime.$ The corresponding $x$ is

$$x^{(1)} = V y^{(1)},$$

the first column of $V.$ This is the first principal component. Its variance is $\lambda_{11}.$ By construction, it is a unit vector with the largest possible variance. It represents a linear subspace.

The rest of the principal components are obtained similarly from the other columns of $V$ because (by definition) those columns are mutually orthogonal.

When all the $\lambda_{ii}$ are distinct, this method gives a unique set of solutions:

The principal components of $C$ are the linear subspaces corresponding to the columns of $V.$ The variance of column $i$ is $\lambda_{ii}.$

More generally, there may be infinitely many ways to diagonalize $C$ (this is when there are one or more eigenspaces of dimension greater than $1,$ so-called "degenerate" eigenspaces). The columns of any particular such $V$ still enjoy the foregoing properties. $V$ is usually chosen so that $\lambda_{11}\ge\lambda_{22}\ge\cdots\ge\lambda_{pp}$ are the principal components in order.

Related Question