Solved – How to get “eigenvalues” (percentages of explained variance) of vectors that are not PCA eigenvectors

linear algebrapcarvariance

I would like to understand how I can get the percentage of variance of a data set, not in the coordinate space provided by PCA, but against a slightly different set of (rotated) vectors.

enter image description here

set.seed(1234)
xx <- rnorm(1000)
yy <- xx * 0.5 + rnorm(1000, sd = 0.6)
vecs <- cbind(xx, yy)
plot(vecs, xlim = c(-4, 4), ylim = c(-4, 4))
vv <- eigen(cov(vecs))$vectors
ee <- eigen(cov(vecs))$values
a1 <- vv[, 1]
a2 <- vv[, 2]
theta = pi/10
rotmat <- matrix(c(cos(theta), sin(theta), -sin(theta), cos(theta)), 2, 2)
a1r <- a1 %*% rotmat
a2r <- a2 %*% rotmat
arrows(0, 0, a1[1], a1[2], lwd = 2, col = "red")
arrows(0, 0, a2[1], a2[2], lwd = 2, col = "red")
arrows(0, 0, a1r[1], a1r[2], lwd = 2, col = "green3")
arrows(0, 0, a2r[1], a2r[2], lwd = 2, col = "green3")
legend("topleft", legend = c("eigenvectors", "rotated"), fill = c("red", "green3"))

So basically I know that the variance of the dataset along each of the red axes, given by PCA, is represented by the eigenvalues. But how could I get the equivalent variances, totalling the same amount, but projected the two different axes in green, which are are a rotation by pi/10 of the principal component axes. IE given two orthogonal unit vectors from the origin, how can I get the variance of a dataset along each of these arbitrary (but orthogonal) axes, such that all the variance is accounted for (ie "eigenvalues" sum to the same as that of PCA).

Best Answer

If the vectors are orthogonal, you can just take the variance of the scalar projection of the data onto each vector. Say we have a data matrix $X$ ($n$ points x $d$ dimensions), and a set of orthonormal column vectors $\{v_1, ..., v_k\}$. Assume the data are centered. The variance of the data along the direction of each vector $v_i$ is given by $\text{Var}(X v_i)$.

If there are as many vectors as original dimensions ($k = d$), the sum of the variances of the projections will equal the sum of the variances along the original dimensions. But, if there are fewer vectors than original dimensions ($k < d$), the sum of variances will generally be less than for PCA. One way to think of PCA is that it maximizes this very quantity (subject to the constraint that the vectors are orthogonal).

You may also want to calculate $R^2$ (the fraction of variance explained), which is often used to measure how well a given number of PCA dimensions represent the data. Let $S$ represent the sum of the variances along each original dimension of the data. Then:

$$R^2 = \frac{1}{S}\sum_{i=1}^{k} \text{Var}(X v_i)$$

This is just the ratio of the summed variances of the projections and the summed variances along the original dimensions.

Another way to think about $R^2$ is that it measures the goodness of fit if we try to reconstruct the data from the projections. It then takes the familiar form used for other models (e.g. regression). Say the $i$th data point is a row vector $x_{(i)}$. Store each of the basis vectors along the columns of matrix $V$. The projection of the $i$th data point onto all vectors in $V$ is given by $p_{(i)} = x_{(i)} V$. When there are fewer vectors than original dimensions ($k < d$), we can think of this as mapping the data linearly into a space with reduced dimensionality. We can approximately reconstruct the data point from the low dimensional representation by mapping back into the original data space: $\hat{x}_{(i)} = p_{(i)} V^T$. The mean squared reconstruction error is the mean squared Euclidean distance between each original data point and its reconstruction:

$$E = \frac{1}{n} \|x_{(i)} - \hat{x}_{(i)}\|^2$$

The goodness of fit $R^2$ is defined the same way as for other models (i.e. as one minus the fraction of unexplained variance). Given the mean squared error of the model ($\text{MSE}$) and the total variance of the modeled quantity ($\text{Var}_{\text{total}}$), $R^2 = 1 - \text{MSE} / \text{Var}_{\text{total}}$. In the context of our data reconstruction, the mean squared error is $E$ (the reconstruction error). The total variance is $S$ (the sum of variances along each dimension of the data). So:

$$R^2 = 1 - \frac{E}{S}$$

$S$ is also equal to the mean squared Euclidean distance from each data point to the mean of all data points, so we can also think of $R^2$ as comparing the reconstruction error to that of the 'worst-case model' that always returns the mean as the reconstruction.

The two expressions for $R^2$ are equivalent. As above, if there are as many vectors as original dimensions ($k = d$) then $R^2$ will be one. But, if $k < d$, $R^2$ will generally be less than for PCA. Another way to think about PCA is that it minimizes the squared reconstruction error.

Related Question