Machine Learning – Topic Stability in Topic Models

dirichlet-processmachine learningmodel selectionsmall-sampletopic-models

I am working on a project where I want to extract some information about the content of a series of open-ended essays. In this particular project, 148 people wrote essays about a hypothetical student organization as part of a larger experiment. Although in my field (social psychology), the typical way to analyze these data would be to code the essays by hand, I'd like to do this quantitatively, since hand-coding is both labor-intensive and a bit too subjective for my taste.

During my investigations about ways to quantitatively analyze free response data, I stumbled upon an approach called topic modelling (or Latent Dirichlet Allocation, or LDA). Topic modeling takes a bag-of-words representation of your data (a term-document matrix) and uses information about the word co-occurrences to extract the latent topics of the data. This approach seems perfect for my application.

Unfortunately, when I've applied topic modeling to my data, I've discovered two issues:

  1. The topics uncovered by topic modelling are sometimes hard to interpret
  2. When I re-run my topic models with a different random seed, the topics seem to change dramatically

Issue 2 in particular concerns me. Therefore, I have a two related questions:

  1. Is there anything I can do in the LDA procedure to optimize my model fit procedure for interpretability and stability? Personally, I don't care as much about finding the model with the lowest perplexity and / or best model fit — I mainly want to use this procedure to help me understand and characterize what the participants in this study wrote in their essays. However, I certainly do not want my results to be an artifact of the random seed!
  2. Related to the above question, are there any standards for how much data you need to do an LDA? Most of the papers I've seen that have used this method analyze large corpora (e.g., an archive of all Science papers from the past 20 years), but, since I'm using experimental data, my corpus of documents is much smaller.

I have posted the essay data here for anyone who wants to get his or her hands dirty, and I have pasted the R code I'm using below.

require(tm)
require(topicmodels)

# Create a corpus from the essay 
c <- Corpus(DataframeSource(essays))
inspect(c)

# Remove punctuation and put the words in lower case
c <- tm_map(c, removePunctuation)
c <- tm_map(c, tolower)

# Create a DocumentTermMatrix.  The stopwords are the LIWC function word categories
# I have a copy of the LIWC dictionary, but if you want to do a similar analysis,
# use the default stop words in tm
dtm <- DocumentTermMatrix(c, control = list(stopwords = 
  c(dict$funct, dict$pronoun, dict$ppron, dict$i, dict$we, dict$you, dict$shehe, 
    dict$they, dict$inpers, dict$article, dict$aux)))

# Term frequency inverse-document frequency to select the desired words
term_tfidf <- tapply(dtm$v/rowSums(as.matrix(dtm))[dtm$i], dtm$j, mean) * log2(nDocs(dtm)/colSums(as.matrix(dtm)))
summary(term_tfidf)

dtm <- dtm[, term_tfidf >= 0.04]

lda <- LDA(dtm, k = 5, seed = 532)
perplexity(lda)
(terms <- terms(lda, 10))
(topics <- topics(lda))

Edit:

I tried modifying nstart as suggested by Flounderer in the comments. Unfortunately, as shown below, even setting nstart to 1000 results in topics that vary quite dramatically from random seed to random seed. Just to emphasize again, the only thing I'm changing in the estimation of the two models below is the random seed used to start model estimation, and yet the topics do not seem to be at all consistent in these two runs.

lda <- LDA(dtm, k = 5, seed = 535, control = list(nstart = 1000))
(terms <- terms(lda, 10))

      Topic 1         Topic 2      Topic 3      Topic 4       Topic 5      
 [1,] "international" "ethnicity"  "free"       "credit"      "kind"       
 [2,] "communicate"   "true"       "team"       "mandatory"   "bridge"     
 [3,] "gain"          "asians"     "cooperate"  "music"       "close"      
 [4,] "use"           "hand"       "order"      "seen"        "deal"       
 [5,] "big"           "hold"       "play"       "barrier"     "designed"   
 [6,] "communication" "effective"  "big"        "stereotypes" "effort"     
 [7,] "america"       "emphasis"   "beginning"  "asians"      "implemented"
 [8,] "chinese"       "halls"      "china"      "fantastic"   "websites"   
 [9,] "ethnicity"     "minorities" "difference" "focusing"    "planned"    
[10,] "networks"      "population" "easier"     "force"       "body"

lda <- LDA(dtm, k = 5, seed = 536, control = list(nstart = 1000))
(terms <- terms(lda, 10))

      Topic 1       Topic 2         Topic 3        Topic 4       Topic 5    
 [1,] "kind"        "international" "issue"        "willing"     "play"     
 [2,] "easier"      "ethnicity"     "close"        "use"         "trying"   
 [3,] "gain"        "communication" "currently"    "hand"        "unity"    
 [4,] "websites"    "communicate"   "implemented"  "networks"    "decision" 
 [5,] "credit"      "bridge"        "particularly" "stereotypes" "gap"      
 [6,] "effort"      "america"       "credit"       "communicate" "normally" 
 [7,] "barriers"    "connection"    "fulfill"      "came"        "asians"   
 [8,] "effects"     "kind"          "grew"         "asians"      "created"  
 [9,] "established" "order"         "perspectives" "big"         "effective"
[10,] "strangers"   "skills"        "big"          "budget"      "prejudice"

Best Answer

For my own curiosity, I applied a clustering algorithm that I've been working on to this dataset.

I've temporarily put-up the results here (choose the essays dataset).

It seems like the problem is not the starting points or the algorithm, but the data. You can 'reasonably' (subjectively, in my limited experience) get good clusters even with 147 instances as long as there is some hidden topics/concepts/themes/clusters (whatever you would like to call).

If the data does not have well separated topics, then no matter whichever algorithm you use, you might not get good answers.