Dimensionality Reduction – Alternatives to PCA for Large Datasets

dimensionality reductionhigh-dimensionaljavapcatsne

Problem Setup

I have data points (images) of high dimension (4096), which I'm trying to visualize in 2D. To this end, I'm using t-sne in a manner similar to the following example code by Karpathy.

The scikit-learn documentation recommends using PCA to first lower the dimension of the data:

It is highly recommended to use another dimensionality reduction
method (e.g. PCA for dense data or TruncatedSVD for sparse data) to
reduce the number of dimensions to a reasonable amount (e.g. 50) if
the number of features is very high.

I'm using this code by Darks.Liu to perform PCA in Java:

//C=X*X^t / m
DoubleMatrix covMatrix = source.mmul(source.transpose()).div(source.columns);
ComplexDoubleMatrix eigVal = Eigen.eigenvalues(covMatrix);
ComplexDoubleMatrix[] eigVectorsVal = Eigen.eigenvectors(covMatrix);
ComplexDoubleMatrix eigVectors = eigVectorsVal[0];
//Sort sigen vector from big to small by eigen values 
List<PCABean> beans = new ArrayList<PCA.PCABean>();
for (int i = 0; i < eigVectors.columns; i++) {
    beans.add(new PCABean(eigVal.get(i).real(), eigVectors.getColumn(i)));
}
Collections.sort(beans);
DoubleMatrix newVec = new DoubleMatrix(dimension, beans.get(0).vector.rows);
for (int i = 0; i < dimension; i++) {
    ComplexDoubleMatrix dm = beans.get(i).vector;
    DoubleMatrix real = dm.getReal();
    newVec.putRow(i, real);
}
return newVec.mmul(source);

It uses jblas for the linear algebra operations, which from what I've read is supposed to be the fastest option out there. However, calculating the eigenvectors and eigenvalues (lines 3,4) turns out to be a huge bottleneck (~10 minutes, which is much longer than I can afford for this stage).

I've read about Kernel PCA which is supposed to be good for cases in which the dimension is very large, but its runtime is $O(n^3)$ which could be problematic since I also want to deal with cases of both dimension and number of examples being large.

As I see it, my options are either to "optimize" PCA or to opt for another dimensionality reduction method which is inherently faster.

My Questions

  1. Is there any hope that PCA can be used in an "offline" fashion? i.e, using a large data set of images, perform PCA on them, and then use the principal components computed for them to reduce the dimension of other (new!) data points?
  2. Can I speed up the eigenvectors calculation, assuming I know ahead of time that I'm only interested in, say, the top 100 principal components?
  3. Is there an alternative dimensionality reduction method that is appropriate in my case (i.e, before applying t-sne) which will be faster than PCA? I'm looking for something which can be implemented easily in Java.

Best Answer

Question 1: Let's say you have observed a data matrix $X \in \mathbb R^{n \times p}$. From this you can compute the eigendecomposition $X^T X = Q \Lambda Q^T$. The question now is: if we get new data coming from the same population, perhaps collected into a matrix $Z \in \mathbb R^{m \times p}$, will $ZQ$ be close to the ideal orthogonal rotation of $Z$? This kind of question is addressed by the Davis-Kahan theorem, and matrix perturbation theory in general (if you can get ahold of a copy, Stewart and Sun's 1990 textbook is the standard reference).

Question 2: you definitely can speed things up if you know you only need the top $k$ eigenvectors. In R I use rARPACK for this; I'm sure there's a Java equivalent since they're all fortran wrappers anyway.

Question 3: I don't know anything about Java implementations, but this thread discusses speeding up PCA as does this CV thread. There's a ton of research on this sort of thing and there are tons of methods out there using things like low rank approximations or randomization.