Solved – Good PCA examples for teaching

datasetpcateaching

I'm teaching linear algebra to a class of engineers, social scientists and computer programmers. We just did singular value decomposition, and we have an extra day, so I thought I'd talk about the relation between singular value decomposition and principal component analysis. I have the theory part of the lecture written just fine, but am finding a hard time finding good examples to use. Here are the constraints:

  • I want to show pictures. Ideally, the graphics should work well by themselves: Axes and data points in the scatter-plot should be labelled. English words are better than Latin species names.

  • The question being studied should be interesting. Morphology of Nigerian fish, while important, is not a good way to grab a class's attention.

  • In contrast to the preceding bullet point: Nothing on human racial differences; nothing on intelligence testing. That would lead to a lively discussion which would have nothing to do with the mathematical techniques.

  • The mathematical analysis method should be basically pure PCA. The DW-NOMINATE project, while awesome, uses PCA as a starting point which is followed by a much more complicated hill climbing algorithm.

I would think this would be easy. I can easily think of a dozen fun analysis projects I could do if I had the time to gather the data: Take the Pew Research polls and see whether PCA recovers the social policy/fiscal policy axis beloved of libertarians. Take a dozen measurements of typical dog breed physical characteristics and see if PCA can find the "sheep dog" cluster. Etcetera, etcetera… I'm looking for someone else who has already done the work so I can show it off.

Best Answer

There are some step-by-step guides in Shalizi's notes here : http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch18.pdf, one being the cars data set from R and another being art and music articles from the New York Times. (Inferring the topic of an article from the words contained in it is a very active research area.) If you don't know/don't want to learn R then you could still use his notes and graphics.

Edit: forgot to say that there are also several good examples in a book by Everitt and Hothorn, which is available on SpringerLink. As I recall, one data set is jet fighters and there is also Roman pottery.