# R – Why Log-Transform Data Before Principal Component Analysis

data transformationlogarithmpcar

Im following a tutorial here: http://www.r-bloggers.com/computing-and-visualizing-pca-in-r/ to gain a better understanding of PCA.

The tutorial uses the Iris dataset and applies a log transform prior to PCA:

Notice that in the following code we apply a log transformation to the continuous variables as suggested by [1] and set center and scale equal to TRUE in the call to prcomp to standardize the variables prior to the application of PCA.

Could somebody explain to me in plain English why you first use the log function on the the first four columns of the Iris dataset. I understand it has something to do with making data
relative but am confused what's exactly the function of log, center and scale.

The reference [1] above is to Venables and Ripley, Modern applied statistics with S-PLUS, Section 11.1 that briefly says:

The data are physical measurements, so a sound initial strategy is to work on log scale. This has been done throughout.

The iris data set is a fine example to learn PCA. That said, the first four columns describing length and width of sepals and petals are not an example of strongly skewed data. Therefore log-transforming the data does not change the results much, since the resulting rotation of the principal components is quite unchanged by log-transformation.

In other situations log-transformation is a good choice.

We perform PCA to get insight of the general structure of a data set. We center, scale and sometimes log-transform to filter off some trivial effects, which could dominate our PCA. The algorithm of a PCA will in turn find the rotation of each PC to minimize the squared residuals, namely the sum of squared perpendicular distances from any sample to the PCs. Large values tend to have high leverage.

Imagine injecting two new samples into the iris data. A flower with 430 cm petal length and one with petal length of 0.0043 cm. Both flowers are very abnormal being 100 times larger and 1000 times smaller respectively than average examples. The leverage of the first flower is huge, such that the first PCs mostly will describe the differences between the large flower and any other flower. Clustering of species is not possible due to that one outlier. If the data are log-transformed, the absolute value now describes the relative variation. Now the small flower is the most abnormal one. Nonetheless it is possible to both contain all samples in one image and provide a fair clustering of the species. Check out this example:

data(iris) #get data
#add two new observations from two new species to iris data
levels(iris[,5]) = c(levels(iris[,5]),"setosa_gigantica","virginica_brevis")
iris[151,] = list(6,3,  430  ,1.5,"setosa_gigantica") # a big flower
iris[152,] = list(6,3,.0043,1.5  ,"virginica_brevis") # a small flower

#Plotting scores of PC1 and PC" without log transformation
plot(prcomp(iris[,-5],cen=T,sca=T)$x[,1:2],col=iris$Spec)


#Plotting scores of PC1 and PC2 with log transformation
plot(prcomp(log(iris[,-5]),cen=T,sca=T)$x[,1:2],col=iris$Spec)