Solved – Replicating Shalizi’s New York Times PCA example

dimensionality reductionpcapythonsvd

I am trying to replicate Shalizi's NY Times PCA example found in his Advanced Data Analysis with an Elementary Point of View book. I found sample code and data here

http://www.stat.cmu.edu/~cshalizi/490/pca/

I loaded the R workspace pca-examples.Rdata, and was able to extract a CSV out of it which I placed in

https://github.com/burakbayramli/classnotes/blob/master/app-math-tr/pca/nytimes.csv

In order for PCA using SVD, I did

from pandas import *
import numpy.linalg as lin
nyt = read_csv ("nytimes.csv")
labels = nyt['class.labels']
nyt = nyt.drop(["class.labels"],axis=1)
nyt_demeaned = nyt - nyt.mean(0)
u,s,v = lin.svd(nyt_demeaned.T,full_matrices=False)

Then I tried to project onto the space defined by the eigenvectors

nyt2 = np.dot(nyt_demeaned, u[:,0:2])

Then plot

plot(nyt2[:,0],nyt2[:,1],'.')

and

arts = nyt2[labels == 'art']
music = nyt2[labels == 'music']
plot(arts[:,0],arts[:,1],'r.')
plot(music[:,0],music[:,1],'b.')

I get

enter image description here

enter image description here

This image is different from Shalizi's picture from his book (using the following R code)

load("pca-examples.Rdata")
nyt.pca = prcomp(nyt.frame[,-1])
nyt.latent.sem = nyt.pca$rotation
plot(nyt.pca$x[,1:2],type="n")
points(nyt.pca$x[nyt.frame[,"class.labels"]=="art",1:2],
       pch="A",col="red")
points(nyt.pca$x[nyt.frame[,"class.labels"]=="music",1:2],
       pch="M",col="blue")

enter image description here

Maybe there was a 90 degree error, flip it one way or another, etc. then you could perhaps have the two images to be somewhat close, but even then the distribution isnt exactly right, and the data points for arts are not as clearly seperated from music data points as they are in Shalizi's picture.

Any ideas?

Best Answer

It turns out Shalizi does some extra normalization, called inverse document-frequency weighting, with this code

scale.cols <- function(x,s) {
  return(t(apply(x,1,function(x){x*s})))
}

div.by.euc.length <- function(x) {
  scale.rows(x,1/sqrt(rowSums(x^2)+1e-16))
}

idf.weight <- function(x) {
  # IDF weighting
  doc.freq <- colSums(x>0)
  doc.freq[doc.freq == 0] <- 1
  w <- log(nrow(x)/doc.freq)
  return(scale.cols(x,w))
}

load("pca-examples.Rdata")
nyt.frame.raw$class.labels <- NULL
nyt.frame <- idf.weight(nyt.frame.raw)
nyt.frame <- div.by.euc.length(nyt.frame)

When I ran SVD on nyt.frame, instead of nyt.frame.raw (after load, the code above is not needed for that, it's just there for demo), the two classes were distinctly seperable. The distribution of points still did not look like Shalizi's picture, but I think I can work with this.

Note: Simply reading nytimes4.csv, the normalized file, is not enough, the data still needs to be centered around 0 (de-meaned).

Note: After skipping dropping a bogus column in nytimes4.csv and demeaning, the output is exactly the same as Shalizi's.

The same thing in Python / Pandas

from pandas import *
nyt = read_csv ("nytimes.csv")
nyt = nyt.drop(['class.labels'],axis=1)
freq = nyt.astype(bool).sum(axis=0)
freq = freq.replace(0,1)
w = np.log(float(nyt.shape[0])/freq)
nyt = nyt.apply(lambda x: x*w,axis=1)
nyt = nyt.apply(lambda x: x / np.sqrt(np.sum(np.square(x))+1e-16), axis=1)
Related Question