Solved – Clear description of PCA using SVD of covariance matrix

dimensionality reductionpcasvd

After reading thousands of articles on PCA and SVD, using them in a number of programming frameworks and even implementing similar techniques (like Random Indexing) I found out that I still have doubts about some parts of PCA for dimension reduction. So let me show what I know and what I have doubts about.

Let's say we have $N$ observations of $M$-dimensional data, organized as matrix $A \in R^{N * M}$. To perform PCA we should first compute MLE estimate of covariance matrix $\Sigma \in R^{M*M}$:

$$\Sigma=\frac{1}{N}\sum_{i=1}^N(x_i – \bar x)(x_i – \bar x)^T$$

where $x_i \in R^M$ is $i$-th observation, $\bar x = \frac{1}{N}\sum_{k=1}^{N}x_k \in R^M$ is mean observation.

Then we can decompose $\Sigma$ using SVD as follows:

$$\Sigma = USV^T$$

Now here are several things I'm not sure about:

What are dimensions of $U$, $S$ and $V^T$?
In $USV^T$ what exactly is considered as eigenvalues and which of them should I use as principal components?
How can I project original observations $x_i$ onto new reduced space and vice versa?

UPD. There's a different way to compute PCA using SVD – by factorizing data matrix ($A$ here) instead of covariance matrix ($\Sigma = AA^T$). Good description for this process may be found in this answer.

Best Answer

What are dimensions of $U$, $S$ and $V^T$?

Since $\Sigma$ is a M by M matrix, the three matrices $U$, $S$, $V^T$ wil be all M by M matrices. Because applying SVD on a N by M matrix, you will get $U_{N{\times}N}$, $S_{N{\times}M}$, and $V^T_{M{\times}M}$. You can verify that in matlab. When you truncate the singular values $S$ you also should remove the corresponding parts in $U$ and $V^T$.

In $USV^T$ what exactly is considered as eigenvalues and which of them should I use as principal components?

PCA should be done by doing eigenvalue decomposition on the covariance matrix $\Sigma$, or done by applying SVD on $A$. The left singular vectors of $SVD(A)$ come from the eigen vectors of $AA^T$, and the right singular vectors of $SVD(A)$ are from the eigenvectors of $A^TA$. But you need to order them according to the eigenvalues from large to small, and make them orthonormal. $A^TA$ is called Gram Matrix and is related to the covariance matrix $\Sigma$. If the dimensional vectors in $A$ (M of them totally) are all centered already, Gram Matrix = N * Covariance matrix. Check Wikipedia and some tutorials of SVD and PCA.

How can I project original observations $x_i$ onto new reduced space and vice versa?

If applying SVD on $A$ for PCA, it would be $u_i*S$; if applying eigen decomposition on covariance matrix $\Sigma$, and $V$ is eigenvectors of $\Sigma$, it is $x_i*V$.

Related Solutions

Principal Component Analysis – Why Use PCA of Data by Means of SVD

Here are my 2ct on the topic

The chemometrics lecture where I first learned PCA used solution (2), but it was not numerically oriented, and my numerics lecture was only an introduction and didn't discuss SVD as far as I recall.
If I understand Holmes: Fast SVD for Large-Scale Matrices correctly, your idea has been used to get a computationally fast SVD of long matrices.
That would mean that a good SVD implementation may internally follow (2) if it encounters suitable matrices (I don't know whether there are still better possibilities). This would mean that for a high-level implementation it is better to use the SVD (1) and leave it to the BLAS to take care of which algorithm to use internally.

Quick practical check: OpenBLAS's svd doesn't seem to make this distinction, on a matrix of 5e4 x 100, svd (X, nu = 0) takes on median 3.5 s, while svd (crossprod (X), nu = 0) takes 54 ms (called from R with microbenchmark).
The squaring of the eigenvalues of course is fast, and up to that the results of both calls are equvalent.

timing  <- microbenchmark (svd (X, nu = 0), svd (crossprod (X), nu = 0), times = 10)
timing
# Unit: milliseconds
#                      expr        min         lq    median         uq        max neval
#            svd(X, nu = 0) 3383.77710 3422.68455 3507.2597 3542.91083 3724.24130    10
# svd(crossprod(X), nu = 0)   48.49297   50.16464   53.6881   56.28776   59.21218    10

update: Have a look at Wu, W.; Massart, D. & de Jong, S.: The kernel PCA algorithms for wide data. Part I: Theory and algorithms , Chemometrics and Intelligent Laboratory Systems , 36, 165 - 172 (1997). DOI: http://dx.doi.org/10.1016/S0169-7439(97)00010-5

This paper discusses numerical and computational properties of 4 different algorithms for PCA: SVD, eigen decomposition (EVD), NIPALS and POWER.

They are related as follows:

computes on      extract all PCs at once       sequential extraction    
X                SVD                           NIPALS    
X'X              EVD                           POWER

The context of the paper are wide $\mathbf X^{(30 \times 500)}$, and they work on $\mathbf{XX'}$ (kernel PCA) - this is just the opposite situation as the one you ask about. So to answer your question about long matrix behaviour, you need to exchange the meaning of "kernel" and "classical".

performance comparison

Not surprisingly, EVD and SVD change places depending on whether the classical or kernel algorithms are used. In the context of this question this means that one or the other may be better depending on the shape of the matrix.

But from their discussion of "classical" SVD and EVD it is clear that the decomposition of $\mathbf{X'X}$ is a very usual way to calculate the PCA. However, they do not specify which SVD algorithm is used other than that they use Matlab's svd () function.

    > sessionInfo ()
    R version 3.0.2 (2013-09-25)
    Platform: x86_64-pc-linux-gnu (64-bit)

    locale:
     [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8     LC_MONETARY=de_DE.UTF-8   
     [6] LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
    [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     

    other attached packages:
    [1] microbenchmark_1.3-0

loaded via a namespace (and not attached):
[1] tools_3.0.2

$ dpkg --list libopenblas*
[...]
ii  libopenblas-base              0.1alpha2.2-3                 Optimized BLAS (linear algebra) library based on GotoBLAS2
ii  libopenblas-dev               0.1alpha2.2-3                 Optimized BLAS (linear algebra) library based on GotoBLAS2

Solved – SVD & ICA — or why doesn’t the other rotation matrix in SVD solve for independent components

I think you are (or at least were) confused about several things here.

Doing SVD of your data matrix $\mathbf{X}$ amounts to doing the PCA. Decomposition $\mathbf{X}=\mathbf{USV^\top}$ decomposes the data into principal axes (eigenvectors of the covariance matrix) and principal "scores", or "principal components", i.e. projections of the data onto the principal axes. The "other rotation matrix" you were asking about is simply these projections (up to the scaling given by singular values). Note that both $\mathbf{U}$ and $\mathbf{V}$ are orthogonal matrices, meaning that principal axes are orthogonal and principal components have correlation zero.

But if you have a linear mixture of some sources, it is usually a non-orthogonal mixture, meaning that truly independent axes (that ICA attempts to find) are (usually) not orthogonal to each other. When you say that

it can be shown that with a rotation, an independent rescaling of each of the rotated axes, and a second rotation you can recover the original, independent axes

you probably mean that taking your data axes (i.e. the set of basis vectors, $\mathbf{I}$), rotating them, scaling and rotating again, you can arrive to the independent axes $\mathbf{A}$. And it is certainly true. If you know $\mathbf{A}$, then you can do SVD of that matrix, and it will give you exactly these rotations and scaling you were talking about. But if you don't know $\mathbf{A}$, then there is no way you can perform an SVD of anything to recover it.

Instead, you need to rely on some assumptions about how the independent components should look like (e.g. they should be as non-Gaussian as possible), and that is what ICA is about.

Best Answer

Related Solutions

Principal Component Analysis – Why Use PCA of Data by Means of SVD

Solved – SVD & ICA — or why doesn’t the other rotation matrix in SVD solve for independent components

Related Question