Im following a tutorial here: http://www.r-bloggers.com/computing-and-visualizing-pca-in-r/ to gain a better understanding of PCA.
The tutorial uses the Iris dataset and applies a log transform prior to PCA:
Notice that in the following code we apply a log transformation to the continuous variables as suggested by [1] and set
center
andscale
equal toTRUE
in the call toprcomp
to standardize the variables prior to the application of PCA.
Could somebody explain to me in plain English why you first use the log function on the the first four columns of the Iris dataset. I understand it has something to do with making data
relative but am confused what's exactly the function of log, center and scale.
The reference [1] above is to Venables and Ripley, Modern applied statistics with S-PLUS, Section 11.1 that briefly says:
The data are physical measurements, so a sound initial strategy is to work on log scale. This has been done throughout.
Best Answer
The iris data set is a fine example to learn PCA. That said, the first four columns describing length and width of sepals and petals are not an example of strongly skewed data. Therefore log-transforming the data does not change the results much, since the resulting rotation of the principal components is quite unchanged by log-transformation.
In other situations log-transformation is a good choice.
We perform PCA to get insight of the general structure of a data set. We center, scale and sometimes log-transform to filter off some trivial effects, which could dominate our PCA. The algorithm of a PCA will in turn find the rotation of each PC to minimize the squared residuals, namely the sum of squared perpendicular distances from any sample to the PCs. Large values tend to have high leverage.
Imagine injecting two new samples into the iris data. A flower with 430 cm petal length and one with petal length of 0.0043 cm. Both flowers are very abnormal being 100 times larger and 1000 times smaller respectively than average examples. The leverage of the first flower is huge, such that the first PCs mostly will describe the differences between the large flower and any other flower. Clustering of species is not possible due to that one outlier. If the data are log-transformed, the absolute value now describes the relative variation. Now the small flower is the most abnormal one. Nonetheless it is possible to both contain all samples in one image and provide a fair clustering of the species. Check out this example: