Solved – Dimensionality reduction (SVD or PCA) on a large, sparse matrix

dimensionality reductionmatrix decompositionpcarsvd

/edit: Further follow up now you can use irlba::prcomp_irlba


/edit: following up on my own post. irlba now has "center" and "scale" arguments, which let you use it to calculate principle components, e.g:

pc <- M %*% irlba(M, nv=5, nu=0, center=colMeans(M), right_only=TRUE)$v


I have a large, sparse Matrix of features I would like to use in a machine learning algorithm:

library(Matrix)
set.seed(42)
rows <- 500000
cols <- 10000
i <- unlist(lapply(1:rows, function(i) rep(i, sample(1:5,1))))
j <- sample(1:cols, length(i), replace=TRUE)
M <- sparseMatrix(i, j)

Because this matrix has many columns, I would like to reduce its dimensionality to something more manageable. I can use the excellent irlba package to perform SVD and return the first n principal components (5 shown here; I'll probably use 100 or 500 on my actual dataset):

library(irlba)
pc <- irlba(M, nu=5)$u

However, I've read that prior to performing PCA, one should center the matrix (subtract the column mean from each column). This is very difficult to do on my dataset, and furthermore would destroy the sparsity of the matrix.

How "bad" is it to perform SVD on the un-scaled data, and feed it straight into a machine learning algorithm? Are there any efficient ways I could scale this data, while preserving the sparsity of the matrix?


/edit: A brought to my attention by B_miner, the "PCs" should really be:

pc <- M %*% irlba(M, nv=5, nu=0)$v 

Also, I think whuber's answer should be pretty easy to implement, via the crossprod function, which is extremely fast on sparse matrices:

system.time(M_Mt <- crossprod(M)) # 0.463 seconds
system.time(means <- colMeans(M)) #0.003 seconds

Now I'm not quite sure what to do to the means vector before subtracting from M_Mt, but will post as soon as I figure it out.


/edit3: Here's modified version of whuber's code, using sparse matrix operations for each step of the process. If you can store the entire sparse matrix in memory, it works very quickly:

library('Matrix')
library('irlba')
set.seed(42)
m <- 500000
n <- 100
i <- unlist(lapply(1:m, function(i) rep(i, sample(25:50,1))))
j <- sample(1:n, length(i), replace=TRUE)
x <- sparseMatrix(i, j, x=runif(length(i)))

n_comp <- 50
system.time({
  xt.x <- crossprod(x)
  x.means <- colMeans(x)
  xt.x <- (xt.x - m * tcrossprod(x.means)) / (m-1)
  svd.0 <- irlba(xt.x, nu=0, nv=n_comp, tol=1e-10)
})
#user  system elapsed 
#0.148   0.030   2.923 

system.time(pca <- prcomp(x, center=TRUE))
#user  system elapsed 
#32.178   2.702  12.322

max(abs(pca$center - x.means))
max(abs(xt.x - cov(as.matrix(x))))
max(abs(abs(svd.0$v / pca$rotation[,1:n_comp]) - 1))

If you set the number of columns to 10,000 and the number of principal components to 25, the irlba-based PCA takes about 17 minutes to calculate 50 approximate principal components and consumes about 6GB of RAM, which isn't too bad.

Best Answer

First of all, you really do want to center the data. If not, the geometric interpretation of PCA shows that the first principal component will be close to the vector of means and all subsequent PCs will be orthogonal to it, which will prevent them from approximating any PCs that happen to be close to that first vector. We can hope that most of the later PCs will be approximately correct, but the value of that is questionable when it's likely the first several PCs--the most important ones--will be quite wrong.

So, what to do? PCA proceeds by means of a singular value decomposition of the matrix $X$. The essential information will be contained in $X X'$, which in this case is a $10000$ by $10000$ matrix: that may be manageable. Its computation involves about 50 million calculations of dot products of one column with the next.

Consider any two columns, then, $Y$ and $Z$ (each one of them is a $500000$-vector; let this dimension be $n$). Let their means be $m_Y$ and $m_Z$, respectively. What you want to compute is, writing $\mathbf{1}$ for the $n$-vector of $1$'s,

$$(Y - m_Y\mathbf{1}) \cdot (Z - m_Z\mathbf{1}) = Y\cdot Z - m_Z\mathbf{1}\cdot Y - m_Y\mathbf{1}.Z + m_Z m_Y \mathbf{1}\cdot \mathbf{1}\\ = Y\cdot Z -n (m_Ym_Z),$$

because $m_Y = \mathbf{1}\cdot Y / n$ and $m_Z = \mathbf{1}\cdot Z/n$.

This allows you to use sparse matrix techniques to compute $X X'$, whose entries provide the values of $Y\cdot Z$, and then adjust its coefficients based on the $10000$ column means. The adjustment shouldn't hurt, because it seems unlikely $X X'$ will be very sparse.


Example

The following R code demonstrates this approach. It uses a stub, get.col, which in practice might read one column of $X$ at a time from an external data source, thereby reducing the amount of RAM required (at some cost in computation speed, of course). It computes PCA in two ways: via SVD applied to the preceding construction and directly using prcomp. It then compares the output of the two approaches. The computation time is about 50 seconds for 100 columns and scales approximately quadratically: be prepared to wait when performing SVD on a 10K by 10K matrix!

m <- 500000 # Will be 500,000
n <- 100    # will be 10,000
library("Matrix")
x <- as(matrix(pmax(0,rnorm(m*n, mean=-2)), nrow=m), "sparseMatrix")
#
# Compute centered version of x'x by having at most two columns
# of x in memory at any time.
#
get.col <- function(i) x[,i] # Emulates reading a column
system.time({
  xt.x <- matrix(numeric(), n, n)
  x.means <- rep(numeric(), n)
  for (i in 1:n) {
    i.col <- get.col(i)
    x.means[i] <- mean(i.col)
    xt.x[i,i] <- sum(i.col * i.col)
    if (i < n) {
      for (j in (i+1):n) {
        j.col <- get.col(j)
        xt.x[i,j] <- xt.x[j,i] <- sum(j.col * i.col)
      }    
    }
  }
  xt.x <- (xt.x - m * outer(x.means, x.means, `*`)) / (m-1)
  svd.0 <- svd(xt.x / m)
}
)
system.time(pca <- prcomp(x, center=TRUE))
#
# Checks: all should be essentially zero.
#
max(abs(pca$center - x.means))
max(abs(xt.x - cov(x)))
max(abs(abs(svd.0$v / pca$rotation) - 1)) # (This is an unstable calculation.)