I'd like to see a benchmark of their sparse implementation against a modern CART implementations used in rf's. That paper is quite old in terms of advances in this area and I would be surprised if it still provided significant speed up.
Part of the reason is that using a clever sorting algorithm like Quicksort in split searching can provide near O(n) performance for near constant features (including sparse ones). Fast implementations also track when a feature has become constant within a branch of a tree and should no longer be examined. Dense feature representations provide fast look ups in a cpu cache friendly fashion so you'd need a really clever sparse representation to win out in cpu cycles.
This is discussed here, here, here.
I actually implemented a sparse data representation of data at one point in my rf package CloudForest but found it to be slower then a dense representation of the data and abandoned it though it did provide some memory advantages.
My recommendation would be to try scikit learn or cloudforest built in boosting stuff and see if it is fast enough. Both can be extended with custom boosting criteria if you want to do something non standard. (I actually wrote cloudforest originally to work with large, highly dimensional genetic data sets which are very similar to what you are describing).
First of all, you really do want to center the data. If not, the geometric interpretation of PCA shows that the first principal component will be close to the vector of means and all subsequent PCs will be orthogonal to it, which will prevent them from approximating any PCs that happen to be close to that first vector. We can hope that most of the later PCs will be approximately correct, but the value of that is questionable when it's likely the first several PCs--the most important ones--will be quite wrong.
So, what to do? PCA proceeds by means of a singular value decomposition of the matrix $X$. The essential information will be contained in $X X'$, which in this case is a $10000$ by $10000$ matrix: that may be manageable. Its computation involves about 50 million calculations of dot products of one column with the next.
Consider any two columns, then, $Y$ and $Z$ (each one of them is a $500000$-vector; let this dimension be $n$). Let their means be $m_Y$ and $m_Z$, respectively. What you want to compute is, writing $\mathbf{1}$ for the $n$-vector of $1$'s,
$$(Y - m_Y\mathbf{1}) \cdot (Z - m_Z\mathbf{1}) = Y\cdot Z - m_Z\mathbf{1}\cdot Y - m_Y\mathbf{1}.Z + m_Z m_Y \mathbf{1}\cdot \mathbf{1}\\ = Y\cdot Z -n (m_Ym_Z),$$
because $m_Y = \mathbf{1}\cdot Y / n$ and $m_Z = \mathbf{1}\cdot Z/n$.
This allows you to use sparse matrix techniques to compute $X X'$, whose entries provide the values of $Y\cdot Z$, and then adjust its coefficients based on the $10000$ column means. The adjustment shouldn't hurt, because it seems unlikely $X X'$ will be very sparse.
Example
The following R
code demonstrates this approach. It uses a stub, get.col
, which in practice might read one column of $X$ at a time from an external data source, thereby reducing the amount of RAM required (at some cost in computation speed, of course). It computes PCA in two ways: via SVD applied to the preceding construction and directly using prcomp
. It then compares the output of the two approaches. The computation time is about 50 seconds for 100 columns and scales approximately quadratically: be prepared to wait when performing SVD on a 10K by 10K matrix!
m <- 500000 # Will be 500,000
n <- 100 # will be 10,000
library("Matrix")
x <- as(matrix(pmax(0,rnorm(m*n, mean=-2)), nrow=m), "sparseMatrix")
#
# Compute centered version of x'x by having at most two columns
# of x in memory at any time.
#
get.col <- function(i) x[,i] # Emulates reading a column
system.time({
xt.x <- matrix(numeric(), n, n)
x.means <- rep(numeric(), n)
for (i in 1:n) {
i.col <- get.col(i)
x.means[i] <- mean(i.col)
xt.x[i,i] <- sum(i.col * i.col)
if (i < n) {
for (j in (i+1):n) {
j.col <- get.col(j)
xt.x[i,j] <- xt.x[j,i] <- sum(j.col * i.col)
}
}
}
xt.x <- (xt.x - m * outer(x.means, x.means, `*`)) / (m-1)
svd.0 <- svd(xt.x / m)
}
)
system.time(pca <- prcomp(x, center=TRUE))
#
# Checks: all should be essentially zero.
#
max(abs(pca$center - x.means))
max(abs(xt.x - cov(x)))
max(abs(abs(svd.0$v / pca$rotation) - 1)) # (This is an unstable calculation.)
Best Answer
A possible solution is to use the standard sparse PCA algorithm and increase the ridge penalty coefficient.
There are probably better solutions but this is what I did.