First of all, you really do want to center the data. If not, the geometric interpretation of PCA shows that the first principal component will be close to the vector of means and all subsequent PCs will be orthogonal to it, which will prevent them from approximating any PCs that happen to be close to that first vector. We can hope that most of the later PCs will be approximately correct, but the value of that is questionable when it's likely the first several PCs--the most important ones--will be quite wrong.
So, what to do? PCA proceeds by means of a singular value decomposition of the matrix $X$. The essential information will be contained in $X X'$, which in this case is a $10000$ by $10000$ matrix: that may be manageable. Its computation involves about 50 million calculations of dot products of one column with the next.
Consider any two columns, then, $Y$ and $Z$ (each one of them is a $500000$-vector; let this dimension be $n$). Let their means be $m_Y$ and $m_Z$, respectively. What you want to compute is, writing $\mathbf{1}$ for the $n$-vector of $1$'s,
$$(Y - m_Y\mathbf{1}) \cdot (Z - m_Z\mathbf{1}) = Y\cdot Z - m_Z\mathbf{1}\cdot Y - m_Y\mathbf{1}.Z + m_Z m_Y \mathbf{1}\cdot \mathbf{1}\\ = Y\cdot Z -n (m_Ym_Z),$$
because $m_Y = \mathbf{1}\cdot Y / n$ and $m_Z = \mathbf{1}\cdot Z/n$.
This allows you to use sparse matrix techniques to compute $X X'$, whose entries provide the values of $Y\cdot Z$, and then adjust its coefficients based on the $10000$ column means. The adjustment shouldn't hurt, because it seems unlikely $X X'$ will be very sparse.
Example
The following R
code demonstrates this approach. It uses a stub, get.col
, which in practice might read one column of $X$ at a time from an external data source, thereby reducing the amount of RAM required (at some cost in computation speed, of course). It computes PCA in two ways: via SVD applied to the preceding construction and directly using prcomp
. It then compares the output of the two approaches. The computation time is about 50 seconds for 100 columns and scales approximately quadratically: be prepared to wait when performing SVD on a 10K by 10K matrix!
m <- 500000 # Will be 500,000
n <- 100 # will be 10,000
library("Matrix")
x <- as(matrix(pmax(0,rnorm(m*n, mean=-2)), nrow=m), "sparseMatrix")
#
# Compute centered version of x'x by having at most two columns
# of x in memory at any time.
#
get.col <- function(i) x[,i] # Emulates reading a column
system.time({
xt.x <- matrix(numeric(), n, n)
x.means <- rep(numeric(), n)
for (i in 1:n) {
i.col <- get.col(i)
x.means[i] <- mean(i.col)
xt.x[i,i] <- sum(i.col * i.col)
if (i < n) {
for (j in (i+1):n) {
j.col <- get.col(j)
xt.x[i,j] <- xt.x[j,i] <- sum(j.col * i.col)
}
}
}
xt.x <- (xt.x - m * outer(x.means, x.means, `*`)) / (m-1)
svd.0 <- svd(xt.x / m)
}
)
system.time(pca <- prcomp(x, center=TRUE))
#
# Checks: all should be essentially zero.
#
max(abs(pca$center - x.means))
max(abs(xt.x - cov(x)))
max(abs(abs(svd.0$v / pca$rotation) - 1)) # (This is an unstable calculation.)
Best Answer
Whether sparse PCA is easier to interpret than standard PCA or not, depends on the dataset you are investigating. Here is how I think about it: sometimes one is more interested in the PCA projections (low dimensional representation of the data), and sometimes -- in the principal axes; it is only in the latter case that sparse PCA can have any benefits for the interpretation. Let me give a couple of examples.
I am e.g. working with neural data (simultaneous recordings of many neurons) and am applying PCA and/or related dimensionality reduction techniques to get a low-dimensional representation of neural population activity. I might have 1000 neurons (i.e. my data live in 1000-dimensional space) and want to project it on the three leading principal axes. What these axes are, is totally irrelevant for me, and I have no intention of "interpreting" these axes in any way. What I am interested, is the 3D projection (as the activity depends on time, I get a trajectory in this 3D space). So I am fine if each axis has all 1000 non-zero coefficients.
On the other hand, somebody might be working with more "tangible" data, where individual dimensions have obvious meaning (unlike individual neurons above). E.g. a dataset of various cars, where dimensions are anything from weight to price. In this case one might actually be interested in the leading principal axes themselves, because one might want to say something: look, the 1st principal axis corresponds to the "fanciness" of the car (I am totally making this up now). If the projection is sparse, such interpretations would generally be easier to give, because many variables will have $0$ coefficients and so are obviously irrelevant for this particular axis. In the case of standard PCA, one usually gets non-zero coefficients for all variables.
You can find more examples and some discussion of the latter case in the 2006 Sparse PCA paper by Zou et al. The difference between the former and the latter case, however, I did not see explicitly discussed anywhere (even though it probably was).