I am trying to replicate Shalizi's NY Times PCA example found in his Advanced Data Analysis with an Elementary Point of View book. I found sample code and data here
http://www.stat.cmu.edu/~cshalizi/490/pca/
I loaded the R workspace pca-examples.Rdata, and was able to extract a CSV out of it which I placed in
https://github.com/burakbayramli/classnotes/blob/master/app-math-tr/pca/nytimes.csv
In order for PCA using SVD, I did
from pandas import *
import numpy.linalg as lin
nyt = read_csv ("nytimes.csv")
labels = nyt['class.labels']
nyt = nyt.drop(["class.labels"],axis=1)
nyt_demeaned = nyt - nyt.mean(0)
u,s,v = lin.svd(nyt_demeaned.T,full_matrices=False)
Then I tried to project onto the space defined by the eigenvectors
nyt2 = np.dot(nyt_demeaned, u[:,0:2])
Then plot
plot(nyt2[:,0],nyt2[:,1],'.')
and
arts = nyt2[labels == 'art']
music = nyt2[labels == 'music']
plot(arts[:,0],arts[:,1],'r.')
plot(music[:,0],music[:,1],'b.')
I get
This image is different from Shalizi's picture from his book (using the following R code)
load("pca-examples.Rdata")
nyt.pca = prcomp(nyt.frame[,-1])
nyt.latent.sem = nyt.pca$rotation
plot(nyt.pca$x[,1:2],type="n")
points(nyt.pca$x[nyt.frame[,"class.labels"]=="art",1:2],
pch="A",col="red")
points(nyt.pca$x[nyt.frame[,"class.labels"]=="music",1:2],
pch="M",col="blue")
Maybe there was a 90 degree error, flip it one way or another, etc. then you could perhaps have the two images to be somewhat close, but even then the distribution isnt exactly right, and the data points for arts are not as clearly seperated from music data points as they are in Shalizi's picture.
Any ideas?
Best Answer
It turns out Shalizi does some extra normalization, called inverse document-frequency weighting, with this code
When I ran SVD on nyt.frame, instead of nyt.frame.raw (after load, the code above is not needed for that, it's just there for demo), the two classes were distinctly seperable. The distribution of points still did not look like Shalizi's picture, but I think I can work with this.
Note: Simply reading nytimes4.csv, the normalized file, is not enough, the data still needs to be centered around 0 (de-meaned).
Note: After skipping dropping a bogus column in nytimes4.csv and demeaning, the output is exactly the same as Shalizi's.
The same thing in Python / Pandas