PCA in Python – How to Perform PCA on Data with High Dimensionality

pcapython

To perform principal component analysis (PCA), you have to subtract the means of each column from the data, compute the correlation coefficient matrix and then find the eigenvectors and eigenvalues. Well, rather, this is what I did to implement it in Python, except it only works with small matrices because the method to find the correlation coefficient matrix (corrcoef) doesn't let me use an array with high dimensionality. Since I have to use it for images, my current implementation doesn't really help me.

I've read that it's possible to just take your data matrix $D$ and compute $DD^\top/n$ instead of $D^\top D/n$, but that doesn't work for me. Well, I'm not exactly sure I understand what it means, besides the fact that it's supposed to be a $n \times n$ matrix instead of $p\times p$ (in my case $p\gg n$). I read up about those in the eigenfaces tutorials but none of them seemed to explain it in such a way I could really get it.

In short, is there a simple algorithmic description of this method so that I can follow it?

Best Answer

The easiest way to do standard PCA is to center the columns of your data matrix (assuming the columns correspond to different variables) by subtracting the column means, and then perform an SVD. The left singular vectors, multiplied by the corresponding singular value, correspond to the (estimated) principal components. The right singular vectors correspond to the (estimated) principal component directions — these are the same as the eigenvectors given by PCA. The singular values correspond to the standard deviations of the principal components (multiplied by a factor of root n, where n is the number of rows in your data matrix) — the same as the square root of the eigenvalues given by PCA.

If you want to do PCA on the correlation matrix, you will need to standardize the columns of your data matrix before applying the SVD. This amounts to subtracting the means (centering) and then dividing by the standard deviations (scaling).

This will be the most efficient approach if you want the full PCA. You can verify with some algebra that this gives you the same answer as doing the spectral decomposition of the sample covariance matrix.

There are also efficient methods for computing a partial SVD, when you only need a few of the PCs. Some of these are variants of the power iteration. The Lanczos algorithm is one example that is also related to partial least squares. If your matrix is huge, you may be better off with an approximate method. There are also statistical reasons for regularizing PCA when this is the case.