Data Visualization with PCA – Visualizing a Million Data Points

biplotdata visualizationpcar

Is it possible to visualize the output of Principal Component Analysis in ways that give more insight than just summary tables? Is it possible to do it when the number of observations is large, say ~1e4? And is it possible to do it in R [other environments welcome]?

Best Answer

The biplot is a useful tool for visualizing the results of PCA. It allows you to visualize the principal component scores and directions simultaneously. With 10,000 observations you’ll probably run into a problem with over-plotting. Alpha blending could help there.

Here is a PC biplot of the wine data from the UCI ML repository:

PC Biplot of Wine Data from  UCI ML Repository

The points correspond to the PC1 and PC2 scores of each observation. The arrows represent the correlation of the variables with PC1 and PC2. The white circle indicates the theoretical maximum extent of the arrows. The ellipses are 68% data ellipses for each of the 3 wine varieties in the data.

I have made the code for generating this plot available here.

Related Question