Solved – Calculate the variance explained in matrix Y by matrix X

multivariate analysispartial least squarespcaregressionvariance

I have two matrices corresponding to the same set of $n$ samples, with $j$ and $k$ variables, respectively ($j > 10000$, $k > 10000$). $X$ is an $n \times j$ matrix and $Y$ is an $n \times k$ matrix.

I'm interested to know what proportion of the overall variance of $Y$ can be explained by $X$. The ultimate goal is to compare $X$ to a similar matrix $X'$ and see which one explais $Y$ and what overlap in the explained variance these both matrices have.

My first approach was to try PLS, however I don't understand this method enough to know what I'm doing, and in any case, the implementation I tested (pls package in R) exited with an error indicating the failure to allocate half a terabyte of memory.

Furthermore, I was considering a following approach (possibly this is re-inventing the wheel; also, it might be completely wrong). PCA can partition the variance in both $X$ and $Y$, such that the resulting components have variances that sum up to 1, and are orthogonal. Therefore, I could try to calculate for the first component of $PCA(Y)$, what percentage of variance is explained using $X$; in R code, I'd try

set.seed(2708)
data(iris)
Y    <- iris[,1:4]
X    <- apply(Y, 2, function(x) x + rnorm( length(x), sd= 0.5))
pcaX <- prcomp(X, scale.= TRUE)
pcaY <- prcomp(Y, scale.= TRUE)

lm.m <- lm(pcaY$x[,1] ~ pcaX$x)
print( summary( lm.m )$r.squared)
# [1] 0.9319846

Then I could add these values for each component, multiplied by the fraction of variance in Y that this component holds:

Yvars <- pcaY$sdev^2/sum(pcaY$sdev^2)
foo   <- apply(pcaY$x, 2, function(y) summary(lm(y ~ pcaX$x))$r.squared)
print(sum(foo * Yvars))
# [1] 0.7993989

X2    <- apply(Y, 2, function(x) x + rnorm(length(x), sd= 5))
pcaX2 <- prcomp(X2, scale.= T)
foo2  <- apply(pcaY$x, 2, function(y) summary(lm(y ~ pcaX2$x))$r.squared)
print(sum(foo2 * Yvars))
# [1] 0.1344896

Does this make ANY sense? Computationally, it works as intended, i.e. the $X$es that are less predictive for $Y$ get a lower value.

However, it does not work (obviously) when $n < j$ (I thought one might choose first $m < j$ components that include 99% of the variance).

Finally, I would like to have a test or criterion to compare $X$ and $X'$.

Best Answer

Yes, you did re-invent the wheel.

What proportion of the overall variance of $Y$ can be explained by $X$

This question is answered by $R^2$ of linear regression, it's just that in this case you have to deal with multivariate regression. Still, you can regress your whole matrix $\mathbf Y$ on $\mathbf X$ (both centered) by finding matrix $\mathbf B$ minimizing $\|\mathbf Y-\mathbf X\mathbf B\|$, and then proportion of explained variance is given by $$R^2 = 1-\frac{\|\mathbf Y-\mathbf X\mathbf B\|^2}{\|\mathbf Y\|^2}.$$ It is probably a one-liner in R using lm.

What you did is a complicated way to compute the same quantity. You can regress each variable in $\mathbf Y$ on all variables in $\mathbf X$, take $R^2$ multiplied by the fraction of variance of $\mathbf Y$ that this variable captures, and sum over all variables. You can replace $\mathbf X$ by its principal components. You can replace $\mathbf Y$ by its principal components too. In each case you will get the same number.

Finally, regarding the situation when there is less data points than dimensions in $\mathbf X$. In this case $R^2 = 1$, because any $\mathbf Y$ can be explained by any $\mathbf X$: you can find $\mathbf B$ such that $\mathbf Y=\mathbf X\mathbf B$ holds exactly.

Related Question