Dimensionality Reduction – How to Perform PCA in R

pcar

I have a big dataset and I want to perform a dimensionality reduction.

Now everywhere I read that I can use PCA for this. However, I still don't seem to get what to do after calculating/performing the PCA. In R this is easily done with the command princomp.

But what to do after calculating the PCA? If I decided I want to use the first $100$ principal components, how do I reduce my dataset exactly?

Best Answer

I believe what you are getting at in your question concerns data truncation using a smaller number of principal components (PC). For such operations, I think the function prcompis more illustrative in that it is easier to visualize the matrix multiplication used in reconstruction.

First, give a synthetic dataset, Xt, you perform the PCA (typically you would center samples in order to describe PC's relating to a covariance matrix:

#Generate data
m=50
n=100
frac.gaps <- 0.5 # the fraction of data with NaNs
N.S.ratio <- 0.25 # the Noise to Signal ratio for adding noise to data

x <- (seq(m)*2*pi)/m
t <- (seq(n)*2*pi)/n

#True field
Xt <- 
 outer(sin(x), sin(t)) + 
 outer(sin(2.1*x), sin(2.1*t)) + 
 outer(sin(3.1*x), sin(3.1*t)) +
 outer(tanh(x), cos(t)) + 
 outer(tanh(2*x), cos(2.1*t)) + 
 outer(tanh(4*x), cos(0.1*t)) + 
 outer(tanh(2.4*x), cos(1.1*t)) + 
 tanh(outer(x, t, FUN="+")) + 
 tanh(outer(x, 2*t, FUN="+"))

Xt <- t(Xt)

#PCA
res <- prcomp(Xt, center = TRUE, scale = FALSE)
names(res)

In the results or prcomp, you can see the PC's (res$x), the eigenvalues (res$sdev) giving information on the magnitude of each PC, and the loadings (res$rotation).

res$sdev
length(res$sdev)
res$rotation
dim(res$rotation)
res$x
dim(res$x)

By squaring the eigenvalues, you get the variance explained by each PC:

plot(cumsum(res$sdev^2/sum(res$sdev^2))) #cumulative explained variance

Finally, you can create a truncated version of your data by using only the leading (important) PCs:

pc.use <- 3 # explains 93% of variance
trunc <- res$x[,1:pc.use] %*% t(res$rotation[,1:pc.use])

#and add the center (and re-scale) back to data
if(res$scale != FALSE){
	trunc <- scale(trunc, center = FALSE , scale=1/res$scale)
}
if(res$center != FALSE){
    trunc <- scale(trunc, center = -1 * res$center, scale=FALSE)
}
dim(trunc); dim(Xt)

You can see that the result is a slightly smoother data matrix, with small scale features filtered out:

RAN <- range(cbind(Xt, trunc))
BREAKS <- seq(RAN[1], RAN[2],,100)
COLS <- rainbow(length(BREAKS)-1)
par(mfcol=c(1,2), mar=c(1,1,2,1))
image(Xt, main="Original matrix", xlab="", ylab="", xaxt="n", yaxt="n", breaks=BREAKS, col=COLS)
box()
image(trunc, main="Truncated matrix (3 PCs)", xlab="", ylab="", xaxt="n", yaxt="n", breaks=BREAKS, col=COLS)
box()

enter image description here

And here is a very basic approach that you can do outside of the prcomp function:

#alternate approach
Xt.cen <- scale(Xt, center=TRUE, scale=FALSE)
C <- cov(Xt.cen, use="pair")
E <- svd(C)
A <- Xt.cen %*% E$u

#To remove units from principal components (A)
#function for the exponent of a matrix
"%^%" <- function(S, power)
     with(eigen(S), vectors %*% (values^power * t(vectors)))
Asc <- A %*% (diag(E$d) %^% -0.5) # scaled principal components

#Relationship between eigenvalues from both approaches
plot(res$sdev^2, E$d) #PCA via a covariance matrix - the eigenvalues now hold variance, not stdev
abline(0,1) # same results

Now, deciding which PCs to retain is a separate question - one that I was interested in a while back. Hope that helps.