PCA in Python – How to Perform PCA on Data with High Dimensionality

pcapython

To perform principal component analysis (PCA), you have to subtract the means of each column from the data, compute the correlation coefficient matrix and then find the eigenvectors and eigenvalues. Well, rather, this is what I did to implement it in Python, except it only works with small matrices because the method to find the correlation coefficient matrix (corrcoef) doesn't let me use an array with high dimensionality. Since I have to use it for images, my current implementation doesn't really help me.

I've read that it's possible to just take your data matrix $D$ and compute $DD^\top/n$ instead of $D^\top D/n$, but that doesn't work for me. Well, I'm not exactly sure I understand what it means, besides the fact that it's supposed to be a $n \times n$ matrix instead of $p\times p$ (in my case $p\gg n$). I read up about those in the eigenfaces tutorials but none of them seemed to explain it in such a way I could really get it.

In short, is there a simple algorithmic description of this method so that I can follow it?

Best Answer

The easiest way to do standard PCA is to center the columns of your data matrix (assuming the columns correspond to different variables) by subtracting the column means, and then perform an SVD. The left singular vectors, multiplied by the corresponding singular value, correspond to the (estimated) principal components. The right singular vectors correspond to the (estimated) principal component directions — these are the same as the eigenvectors given by PCA. The singular values correspond to the standard deviations of the principal components (multiplied by a factor of root n, where n is the number of rows in your data matrix) — the same as the square root of the eigenvalues given by PCA.

If you want to do PCA on the correlation matrix, you will need to standardize the columns of your data matrix before applying the SVD. This amounts to subtracting the means (centering) and then dividing by the standard deviations (scaling).

This will be the most efficient approach if you want the full PCA. You can verify with some algebra that this gives you the same answer as doing the spectral decomposition of the sample covariance matrix.

There are also efficient methods for computing a partial SVD, when you only need a few of the PCs. Some of these are variants of the power iteration. The Lanczos algorithm is one example that is also related to partial least squares. If your matrix is huge, you may be better off with an approximate method. There are also statistical reasons for regularizing PCA when this is the case.

Related Solutions

Dimensionality Reduction – How to Use SVD for Time Series of Different Lengths

There is a reasonably new area of research called Matrix Completion, that probably does what you want. A really nice introduction is given in this lecture by Emmanuel Candes

Principal Component Analysis – Why Use PCA of Data by Means of SVD

Here are my 2ct on the topic

The chemometrics lecture where I first learned PCA used solution (2), but it was not numerically oriented, and my numerics lecture was only an introduction and didn't discuss SVD as far as I recall.
If I understand Holmes: Fast SVD for Large-Scale Matrices correctly, your idea has been used to get a computationally fast SVD of long matrices.
That would mean that a good SVD implementation may internally follow (2) if it encounters suitable matrices (I don't know whether there are still better possibilities). This would mean that for a high-level implementation it is better to use the SVD (1) and leave it to the BLAS to take care of which algorithm to use internally.

Quick practical check: OpenBLAS's svd doesn't seem to make this distinction, on a matrix of 5e4 x 100, svd (X, nu = 0) takes on median 3.5 s, while svd (crossprod (X), nu = 0) takes 54 ms (called from R with microbenchmark).
The squaring of the eigenvalues of course is fast, and up to that the results of both calls are equvalent.

timing  <- microbenchmark (svd (X, nu = 0), svd (crossprod (X), nu = 0), times = 10)
timing
# Unit: milliseconds
#                      expr        min         lq    median         uq        max neval
#            svd(X, nu = 0) 3383.77710 3422.68455 3507.2597 3542.91083 3724.24130    10
# svd(crossprod(X), nu = 0)   48.49297   50.16464   53.6881   56.28776   59.21218    10

update: Have a look at Wu, W.; Massart, D. & de Jong, S.: The kernel PCA algorithms for wide data. Part I: Theory and algorithms , Chemometrics and Intelligent Laboratory Systems , 36, 165 - 172 (1997). DOI: http://dx.doi.org/10.1016/S0169-7439(97)00010-5

This paper discusses numerical and computational properties of 4 different algorithms for PCA: SVD, eigen decomposition (EVD), NIPALS and POWER.

They are related as follows:

computes on      extract all PCs at once       sequential extraction    
X                SVD                           NIPALS    
X'X              EVD                           POWER

The context of the paper are wide $\mathbf X^{(30 \times 500)}$, and they work on $\mathbf{XX'}$ (kernel PCA) - this is just the opposite situation as the one you ask about. So to answer your question about long matrix behaviour, you need to exchange the meaning of "kernel" and "classical".

performance comparison

Not surprisingly, EVD and SVD change places depending on whether the classical or kernel algorithms are used. In the context of this question this means that one or the other may be better depending on the shape of the matrix.

But from their discussion of "classical" SVD and EVD it is clear that the decomposition of $\mathbf{X'X}$ is a very usual way to calculate the PCA. However, they do not specify which SVD algorithm is used other than that they use Matlab's svd () function.

    > sessionInfo ()
    R version 3.0.2 (2013-09-25)
    Platform: x86_64-pc-linux-gnu (64-bit)

    locale:
     [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8     LC_MONETARY=de_DE.UTF-8   
     [6] LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
    [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     

    other attached packages:
    [1] microbenchmark_1.3-0

loaded via a namespace (and not attached):
[1] tools_3.0.2

$ dpkg --list libopenblas*
[...]
ii  libopenblas-base              0.1alpha2.2-3                 Optimized BLAS (linear algebra) library based on GotoBLAS2
ii  libopenblas-dev               0.1alpha2.2-3                 Optimized BLAS (linear algebra) library based on GotoBLAS2

Best Answer

Related Solutions

Dimensionality Reduction – How to Use SVD for Time Series of Different Lengths

Principal Component Analysis – Why Use PCA of Data by Means of SVD

Related Question