Solved – Seeded LDA using topicmodels in R

latent-dirichlet-alloctopic-models

I would like to perform a LDA in R by means of topicmodels. To make the resulting topics better match the topics I want to see, I would like to input "seed words" for these topics into the LDA, comparable to what the Python guidedlda package does (https://medium.freecodecamp.org/how-we-changed-unsupervised-lda-to-semi-supervised-guidedlda-e36a95f3a164).

In the documentation of the topicmodels package, the author writes:

When Gibbs sampling is used for fitting the model, seed words with
their additional weights for the prior parameters can be specified in
order to be able to fit seeded topic models.

And, above:

Optional arguments. For method = "Gibbs" an additional argument
seedwords can be specified as a matrix or an object of class
"simple_triplet_matrix"; the default is NULL.

My questions:

  • Do I understand correctly that topicmodels allows me to do a seeded LDA comparable to the guidedlda package in Python?
  • How exactly does it work, how does the code look like here? Do I simply input another argument "seedwords" into LDA() with a matrix containing two columns (word and weight)?

Best Answer

An author of the topicmodels package was so friendly to provide a sample code which explains how to do it:

library("topicmodels")
data("AssociatedPress", package = "topicmodels")

## We fit 6 topics.
## We specify five seed words for five topics, the sixth topic has no
## seed words.
library("slam")
set.seed(123)
i <- rep(1:5, each = 5)
j <- sample(1:ncol(AssociatedPress), 25)
SeedWeight <- 500 - 0.1
deltaS <- simple_triplet_matrix(i, j, v = rep(SeedWeight, 25),
                                nrow = 6, ncol = ncol(AssociatedPress))
set.seed(1000)
ldaS <- LDA(AssociatedPress, k = 6, method = "Gibbs", seedwords = deltaS, 
            control = list(alpha = 0.1, best = TRUE,
                           verbose = 500, burnin = 500, iter = 100, thin = 100, prefix = character()))

apply(deltaS, 1, function(x) which(x == SeedWeight))
apply(posterior(ldaS)$terms, 1, function(x) order(x, decreasing = TRUE)[1:5])

EDIT: The variable j determines the seed words by column number of the simple tripled matrix. In order to feed in your own words, you could do the following:

data <- tidy_text %>%
  cast_dtm(paragraph, word, n) # Document term matrix

i <- rep(1:5, each = 2) # Number of topics, number of "seeded" words per topic

seedWords <- c("service", "health",
               "sustainability", "energy", 
               "dialogue", "stakeholder", 
               "car", "mobility", 
               "test", "drive")

# Find the position of your seed words in your DTM
j <- NULL
for (z in seq_along(seedWords)) { # This loop surely can be made more elegant
  j[z] <- which(data$dimnames$Terms %in% seedWords[z]) 
# the final triplet matrix is not allowed to have NAs; thus, make sure 
# first that your seed words actually appear in the DTM
}