Solved – How to optimize the number of topics using R Mallet

machine learningrtopic-models

I would like to select the optimal number of LDA topics using R's mallet library. I know that there are several ways to do this using other implementations of LDA in R, especially using topicmodels, which allows you to maximize perplexity, maximize the harmonic mean, or maximize the negative log likelihood (though apparently this last method is not quite as good, according to the author of the post I linked to). However, I'm not sure which method works best using mallet.

I further understand that mallet protects against setting the number of topics too high by using asymmetric priors, which would seem to give some "wiggle room" with regard to selecting an ideal number of topics. However, I'm not sure how many topics is reasonable – I've experimented with everything between 20 and 200 topics (over a corpus of some 140 documents) – so this isn't much help.

How can I select the optimal number of topics?

Edit: I've tried the following code (taken from 2), and tried it out. However, it either gives what I believe is an rJava error (which can be fixed by install.packages("rJava"), at least temporarily), or it returns a list of NULL (with length equal to the length of the sequence). I'm not sure if there's a good way around this, or if going another route is better.

library(mallet)
burnin = 50
iter = 200
keep = 50
opt = 20

stopwords <- "~/path/to/stopwords.txt"
mallet.instances <- mallet.import(dataframe$id, dataframe$text, stopwords)

best.model <- lapply(seq(2,5, by=1), function(k){
  topic.model <- MalletLDA(num.topics=k)
  topic.model$loadDocuments(mallet.instances)
  topic.model$setAlphaOptimization(opt, burnin)
  topic.model$train(iter)
  })
best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik)))
best.model.logLik.df <- data.frame(topics=c(2:5), LL=as.numeric(as.matrix(best.model.logLik)))

library(ggplot2)
ggplot(best.model.logLik.df, aes(x=topics, y=LL)) + 
  xlab("Number of topics") + ylab("Log likelihood of the model") + 
  geom_line() + 
  theme_bw() 

Best Answer

The best "algorithm" for selecting the number of topics is human judgement. Usually, you want to see some level of granularity in the topics and don't want to deal with superfluous detail nor a too coarsely grained version.

Thus, experiment with the number of topics unless you are satisfied with the result. Test the robustness of your topics (e.g., by varying the random seed).