Solved – R: Visualizing document clustering results

clusteringdata visualizationr

I have a k-means clustering result with 35 clusters, there are 5000 documents that each belong to one of the 35 cluster. I would like to visualize the results of the clustering algorithm on a scatter plot (or something similar) where each document is colored based on which cluster they belong to, and their distance on the visualization is proportional to their distance in similarity (i.e. the more similar they are, the closer they appear on the visualization). Ideally, it would also be nice to see the top 10 words that belong to the clusters. I am attaching my code for the clustering algorithm, it deals with data from a database.

myCorpus <- Corpus(VectorSource(userbios$bio))
    docs <- userbios$twitter_id
# convert to lower case
myCorpus <- tm_map(myCorpus, tolower)
# remove punctuation
myCorpus <- tm_map(myCorpus, removePunctuation)
# remove numbers
myCorpus <- tm_map(myCorpus, removeNumbers)
# remove URLs
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
myCorpus <- tm_map(myCorpus, removeURL)
# add one extra stop words:  "via"
myStopwords <- c(stopwords('english'), "twitter", "tweets", "tweet", "tweeting", "account")


# remove stopwords from corpus
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)


myTdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths=c(1,Inf), weighting=weightTfIdf))
# remove sparse terms
myTdm2 <- removeSparseTerms(myTdm, sparse=0.90)

m2 <- as.matrix(myTdm2)
#cluster terms
distMatrix <- dist(scale(m2))
fit <- hclust(distMatrix, method="ward")
# transpose the matrix to cluster documents (tweets)
 m3 <- t(m2)

# k-means clustering
 k <- 35
kmeansResult <- kmeans(m3, k)
#cluster centers
round(kmeansResult$centers, digits=3)
    for (i in 1:k) {
      cat(paste("cluster ", i, ": ", sep=""))
      s <- sort(kmeansResult$centers[i,], decreasing=T)
      cat(names(s)[1:15], "\n")
      # print the tweets of every cluster + # 
      print(docs[which(kmeansResult$cluster==i)])
}

Best Answer

Try to analyze the differences of the clusters first.

K-means is an odd algorithm. Sometimes it just works very well, and in other situations it fails just very very badly. It has a tendency to just split your data set along a single axis.

So you may actually find out that your clustering result is something like this:

Cluster A contains all documents that contain "apple"
Cluster B contains all documents that contain "banana"
Cluster C contains all documents that contain "cocoa"
Cluster D contains all the others

You need to double check your clustering results!

There are several reasons for this. One is the way clusters look to k-means. They're Voronoi cells, separated by orthogonal hyperplanes. The other big effect here is the sparsity of your data set. The mean vectors computed by k-means will be much less sparse, usually. In fact, the average distance between the mean vectors will likely be lower than the distances from your data objects to the closest mean.

The latter probably is a good test: What is the average distance between to cluster centers, and what is the average distance of an observation to the nearest cluster center? Clearly, objects should on average be closer to their cluster center than two cluster centers to each other. But this may actually not hold for sparse data.

Related Solutions

Solved – Visualizing mixed model results

Predicting counts using the fixed-effects part of your model means that you set to zero (i.e. their mean) the random effects. This means that you can "forget" about them and use standard machinery to calculate the predictions and the standard errors of the predictions (with which you can compute the confidence intervals).

This is an example using Stata, but I suppose it can be easily "translated" into R language:

webuse epilepsy, clear
xtmepoisson seizures treat visit || subject: visit
predict log_seiz, xb
gen pred_seiz = exp(log_seiz)
predict std_log_seiz, stdp
gen ub = exp(log_seiz+invnorm(.975)*std_log_seiz)
gen lb = exp(log_seiz-invnorm(.975)*std_log_seiz)

tw (line pred_seiz ub lb visit if treat == 0, sort lc(black black black) ///
 lp(l - -)), scheme(s1mono) legend(off) ytitle("Predicted Seizures") ///
 xtitle("Visit")

The graph refers to treat == 0 and it's intended to be an example (visit is not a really continuous variable, but it's just to get the idea). The dashed lines are 95% confidence intervals.

enter image description here

Solved – Normalizing Term Frequency for document clustering

A common misunderstanding is the term "frequency". To some, it seems to be the count of objects. But usually, frequency is a relative value.

TF/IDF usually is a two-fold normalization.

First, each document is normalized to length 1, so there is no bias for longer or shorter documents. This equals taking the relative frequencies instead of the absolute term counts. This is "TF".

Second, IDF then is a cross-document normalization, that puts less weight on common terms, and more weight on rare terms, by normalizing (weighting) each word with the inverse in-corpus frequency. Here it does not matter whether you use the absolute or relative frequency, as this amounts just to a constant factor across all vectors, so you will get different distances, but only by a constant factor (the corpus size).

To get the formulas right, try to understand why they are supposed to be one way or another. It's worthless to just copy some formula from a source that may even have it wrong. Instead, understand the mathematics and intentions behind it.

Best Answer

Related Solutions

Solved – Visualizing mixed model results

Solved – Normalizing Term Frequency for document clustering

Related Question