Dimensionality Reduction – How to Perform PCA in R

pcar

I have a big dataset and I want to perform a dimensionality reduction.

Now everywhere I read that I can use PCA for this. However, I still don't seem to get what to do after calculating/performing the PCA. In R this is easily done with the command princomp.

But what to do after calculating the PCA? If I decided I want to use the first $100$ principal components, how do I reduce my dataset exactly?

Best Answer

I believe what you are getting at in your question concerns data truncation using a smaller number of principal components (PC). For such operations, I think the function prcompis more illustrative in that it is easier to visualize the matrix multiplication used in reconstruction.

First, give a synthetic dataset, Xt, you perform the PCA (typically you would center samples in order to describe PC's relating to a covariance matrix:

#Generate data
m=50
n=100
frac.gaps <- 0.5 # the fraction of data with NaNs
N.S.ratio <- 0.25 # the Noise to Signal ratio for adding noise to data

x <- (seq(m)*2*pi)/m
t <- (seq(n)*2*pi)/n

#True field
Xt <- 
 outer(sin(x), sin(t)) + 
 outer(sin(2.1*x), sin(2.1*t)) + 
 outer(sin(3.1*x), sin(3.1*t)) +
 outer(tanh(x), cos(t)) + 
 outer(tanh(2*x), cos(2.1*t)) + 
 outer(tanh(4*x), cos(0.1*t)) + 
 outer(tanh(2.4*x), cos(1.1*t)) + 
 tanh(outer(x, t, FUN="+")) + 
 tanh(outer(x, 2*t, FUN="+"))

Xt <- t(Xt)

#PCA
res <- prcomp(Xt, center = TRUE, scale = FALSE)
names(res)

In the results or prcomp, you can see the PC's (res$x), the eigenvalues (res$sdev) giving information on the magnitude of each PC, and the loadings (res$rotation).

res$sdev
length(res$sdev)
res$rotation
dim(res$rotation)
res$x
dim(res$x)

By squaring the eigenvalues, you get the variance explained by each PC:

plot(cumsum(res$sdev^2/sum(res$sdev^2))) #cumulative explained variance

Finally, you can create a truncated version of your data by using only the leading (important) PCs:

pc.use <- 3 # explains 93% of variance
trunc <- res$x[,1:pc.use] %*% t(res$rotation[,1:pc.use])

#and add the center (and re-scale) back to data
if(res$scale != FALSE){
	trunc <- scale(trunc, center = FALSE , scale=1/res$scale)
}
if(res$center != FALSE){
    trunc <- scale(trunc, center = -1 * res$center, scale=FALSE)
}
dim(trunc); dim(Xt)

You can see that the result is a slightly smoother data matrix, with small scale features filtered out:

RAN <- range(cbind(Xt, trunc))
BREAKS <- seq(RAN[1], RAN[2],,100)
COLS <- rainbow(length(BREAKS)-1)
par(mfcol=c(1,2), mar=c(1,1,2,1))
image(Xt, main="Original matrix", xlab="", ylab="", xaxt="n", yaxt="n", breaks=BREAKS, col=COLS)
box()
image(trunc, main="Truncated matrix (3 PCs)", xlab="", ylab="", xaxt="n", yaxt="n", breaks=BREAKS, col=COLS)
box()

enter image description here

And here is a very basic approach that you can do outside of the prcomp function:

#alternate approach
Xt.cen <- scale(Xt, center=TRUE, scale=FALSE)
C <- cov(Xt.cen, use="pair")
E <- svd(C)
A <- Xt.cen %*% E$u

#To remove units from principal components (A)
#function for the exponent of a matrix
"%^%" <- function(S, power)
     with(eigen(S), vectors %*% (values^power * t(vectors)))
Asc <- A %*% (diag(E$d) %^% -0.5) # scaled principal components

#Relationship between eigenvalues from both approaches
plot(res$sdev^2, E$d) #PCA via a covariance matrix - the eigenvalues now hold variance, not stdev
abline(0,1) # same results

Now, deciding which PCs to retain is a separate question - one that I was interested in a while back. Hope that helps.

Example

The following R code demonstrates this approach. It uses a stub, get.col, which in practice might read one column of $X$ at a time from an external data source, thereby reducing the amount of RAM required (at some cost in computation speed, of course). It computes PCA in two ways: via SVD applied to the preceding construction and directly using prcomp. It then compares the output of the two approaches. The computation time is about 50 seconds for 100 columns and scales approximately quadratically: be prepared to wait when performing SVD on a 10K by 10K matrix!

m <- 500000 # Will be 500,000
n <- 100    # will be 10,000
library("Matrix")
x <- as(matrix(pmax(0,rnorm(m*n, mean=-2)), nrow=m), "sparseMatrix")
#
# Compute centered version of x'x by having at most two columns
# of x in memory at any time.
#
get.col <- function(i) x[,i] # Emulates reading a column
system.time({
  xt.x <- matrix(numeric(), n, n)
  x.means <- rep(numeric(), n)
  for (i in 1:n) {
    i.col <- get.col(i)
    x.means[i] <- mean(i.col)
    xt.x[i,i] <- sum(i.col * i.col)
    if (i < n) {
      for (j in (i+1):n) {
        j.col <- get.col(j)
        xt.x[i,j] <- xt.x[j,i] <- sum(j.col * i.col)
      }    
    }
  }
  xt.x <- (xt.x - m * outer(x.means, x.means, `*`)) / (m-1)
  svd.0 <- svd(xt.x / m)
}
)
system.time(pca <- prcomp(x, center=TRUE))
#
# Checks: all should be essentially zero.
#
max(abs(pca$center - x.means))
max(abs(xt.x - cov(x)))
max(abs(abs(svd.0$v / pca$rotation) - 1)) # (This is an unstable calculation.)

Machine Learning – Conducting PCA Using Princomp in MATLAB

Answers to your questions:

I usually use eig(cov(data)) to get the sample eigenvectors but it is matter of personal taste. Probably princomp is better as it gives you the eigenvectors and the projections in one step. If you know $k$ and you want exactly $k$ components it could be easier to use eigs(cov(data),k). Saves you the hassle of computing the higher order components as it returns only the $k$ largest eigenvalues and eigenvectors of the covariance matrix.
Yes, you are mostly using PCA as a dimensionality reduction step. (See next point about accuracy)
Automatic determination of PCA dimensionality is really big subject. Check Minka's work. A nice and quite easily read survey is given by Cangelosi and Goriely here. In short there is no way to have the same data fidelity with your projected data as you would with the original feature vector as by definition you will exclude some modes of variation when ($k < D$). Given that some modes might be just noise though, your classification accuracy might be better; nevertheless in terms of dataset reconstruction you just need to decide of an amount (ie. percentage) of variation you want to retain, you can easily calculate that by the eigenvalue ratios $\frac{\lambda_i}{\Sigma_{i=1}^{k} \lambda_i}$.
I haven't done that before. In general your big set of samples is not a problem; the dimensionality of them is. You can calculate the eigenvectors of 100 million 10-dimensional sample in just seconds. The covariance of a 10 element sample of 10^8 readings/dimensionality each will take you a lot longer. There is work on Online/sequential estimation of PCA according to wikipedia though. Warmuth and Kuzmin's "Randomized online PCA algorithms with regret bounds that are logarithmic in the dimension" seems to investigate just that. (Probably I should also read this paper now that I think of it)

For your final GPU-related question: While I don't work with GPUs extensively myself (and even more I have never used MATLAB for that), I know if you have multiple I/O procedure to the GPU that is a severe computational bottleneck. As such the speed-up you get by the massively parellel computational capabilities of a GPU is nullified by the I/O overhead if you have a small dataset. You'll almost surely get more insightful answers about that point in StackOverflow.

Best Answer

Related Solutions

Dimensionality Reduction – Applying SVD or PCA on Large Sparse Matrices

Example

Machine Learning – Conducting PCA Using Princomp in MATLAB

Related Question