Solved – PCA loadings interpretation

pcarregressionstatistical significance

Regarding the final output of PCA:

Using rotation, I'll get a loading for my 50 components, so the equation would be $pc_1=0.05*a_1+0.02*a_2+\dots+0.04*a_{50}$. Now, can I derive the value of $pc_1$ by using my last observed return for $a_1,a_2,\dots,a_{50}$?
As I looked at my data, I have found at least 10 principal components are required to explain 80% of the variance, so using same logic as above, I'll get $pc_1,pc_2,\dots,pc_{10}$. Now the regression would be $y=w_1*pc_1+w_2*pc_2+…+w_{10}*pc_{10}+e $. In each of them $w_1,w_2,\dots,w_{10}$ can be their respective eigenvalue and the remaining term is an error term?

Best Answer

Everything sounds pretty OK except the ending of Q2. When you do regress y on the first ten PCs, w1,...,w10 are unknown weights to be estimated from your data using Ordinary Least Squares for example.

Related Solutions

Solved – Using PCA for detecting similar regions in an image

It's to be expected that "copied" blocks are almost equal (and more so after the PCA manipulation), so in the lexicographical sort (warning: it's understood that this lexicographic order orders first the most principal component, and so on) "copied" blocks should appear adjacent or near (the reverse is not true: adjacent lexicographicly sorted elements are not necessarily copied, nor even similar)

Here I made up a very simple example myself, in Octave, with a unidimensional signal (y) of size N=200, which has a portion of it copied (here, from 20-50 to 150-180) and a little noise added. I take a small block size (b=3). I convert to PC, sort the rows in lexicographical order (I append first the original block position in an extra column), and compute the distance between adjacent rows (notice that I'm simplifiying a lot here: I'm not discarding components, nor quantizing them; and I'm considering only adjacent rows, not a neighborhood band). I then look at the histogram of those distance, and the original offset is cleary visible.

N=200;
b=3;
delay=130; 
y = filter([1],[1,-0.8,0.1],rand(1,N)-0.5); % my signal, rather arbitrary
y(20+delay:50+delay) = y(20:50);  % a portion is copied
y += (rand(1,N)-0.5)*0.1; % noise added
yy=[y(1:N-2);y(2:N-1);y(3:N)];  % octave does not have  corrmtx (this is not general in b!)
[PC, Z, W, TSQ] = princomp (yy'); % PCA
Z(:,b+1)=[1:N-2]'; % append original block position, in extra row
Z1=sortrows(Z);  % sort rows lexicographycally
Z2=abs(Z1(1:N-3,b+1)-Z1(2:N-2,b+1));  % compute temporal distances between adjacent rows
histo(Z2); % histogram: should show a peak at delay

Dimensionality Reduction – Applying SVD or PCA on Large Sparse Matrices

First of all, you really do want to center the data. If not, the geometric interpretation of PCA shows that the first principal component will be close to the vector of means and all subsequent PCs will be orthogonal to it, which will prevent them from approximating any PCs that happen to be close to that first vector. We can hope that most of the later PCs will be approximately correct, but the value of that is questionable when it's likely the first several PCs--the most important ones--will be quite wrong.

So, what to do? PCA proceeds by means of a singular value decomposition of the matrix $X$. The essential information will be contained in $X X'$, which in this case is a $10000$ by $10000$ matrix: that may be manageable. Its computation involves about 50 million calculations of dot products of one column with the next.

Consider any two columns, then, $Y$ and $Z$ (each one of them is a $500000$-vector; let this dimension be $n$). Let their means be $m_Y$ and $m_Z$, respectively. What you want to compute is, writing $\mathbf{1}$ for the $n$-vector of $1$'s,

$$(Y - m_Y\mathbf{1}) \cdot (Z - m_Z\mathbf{1}) = Y\cdot Z - m_Z\mathbf{1}\cdot Y - m_Y\mathbf{1}.Z + m_Z m_Y \mathbf{1}\cdot \mathbf{1}\\ = Y\cdot Z -n (m_Ym_Z),$$

because $m_Y = \mathbf{1}\cdot Y / n$ and $m_Z = \mathbf{1}\cdot Z/n$.

This allows you to use sparse matrix techniques to compute $X X'$, whose entries provide the values of $Y\cdot Z$, and then adjust its coefficients based on the $10000$ column means. The adjustment shouldn't hurt, because it seems unlikely $X X'$ will be very sparse.

Example

The following R code demonstrates this approach. It uses a stub, get.col, which in practice might read one column of $X$ at a time from an external data source, thereby reducing the amount of RAM required (at some cost in computation speed, of course). It computes PCA in two ways: via SVD applied to the preceding construction and directly using prcomp. It then compares the output of the two approaches. The computation time is about 50 seconds for 100 columns and scales approximately quadratically: be prepared to wait when performing SVD on a 10K by 10K matrix!

m <- 500000 # Will be 500,000
n <- 100    # will be 10,000
library("Matrix")
x <- as(matrix(pmax(0,rnorm(m*n, mean=-2)), nrow=m), "sparseMatrix")
#
# Compute centered version of x'x by having at most two columns
# of x in memory at any time.
#
get.col <- function(i) x[,i] # Emulates reading a column
system.time({
  xt.x <- matrix(numeric(), n, n)
  x.means <- rep(numeric(), n)
  for (i in 1:n) {
    i.col <- get.col(i)
    x.means[i] <- mean(i.col)
    xt.x[i,i] <- sum(i.col * i.col)
    if (i < n) {
      for (j in (i+1):n) {
        j.col <- get.col(j)
        xt.x[i,j] <- xt.x[j,i] <- sum(j.col * i.col)
      }    
    }
  }
  xt.x <- (xt.x - m * outer(x.means, x.means, `*`)) / (m-1)
  svd.0 <- svd(xt.x / m)
}
)
system.time(pca <- prcomp(x, center=TRUE))
#
# Checks: all should be essentially zero.
#
max(abs(pca$center - x.means))
max(abs(xt.x - cov(x)))
max(abs(abs(svd.0$v / pca$rotation) - 1)) # (This is an unstable calculation.)

Best Answer

Related Solutions

Solved – Using PCA for detecting similar regions in an image

Dimensionality Reduction – Applying SVD or PCA on Large Sparse Matrices

Example

Related Question