Solved – Implementing Latent Dirichlet Allocation – notation confusion

dirichlet distributiongibbstopic-models

I am trying to implement LDA using the collapsed Gibbs sampler from
http://www.uoguelph.ca/~wdarling/research/papers/TM.pdf

the main algorithm is shown below

enter image description here

I'm a bit confused about the notation in the inner-most loop. n_dk refers to the count of the number of words assigned to topic k in document d, however I'm not sure which document d this is referring to. Is it the document that word (from the next outer loop) is in? Furthermore, the paper does not show how to get the hyperparameters alpha and beta. Should these be guessed and then tuned? Furthermore, I don't understand what the W refers to in the inner-most loop (or the beta without the subscript).

Could anyone enlighten me?

Best Answer

I would suggest you look at page 8 of "Probabilistic Topic Models" by Mark Steyvers and Tom Griffiths. I found their explanation of the Gibbs algorithm quite clear and easy to implement.

To answer your questions:

  • i seems to range over (indexes for) all the words in all the documents, and d indeed seems to refer to the document of the word under consideration.
  • There are some algorithms for estimating alpha and beta but I haven't really understood any of them myself. Just tuning the parameters manually should be ok in most cases.
  • W seems to be the size of the vocabulary, i.e. the number of unique words.
  • I'm confused about the beta with / without subscript as well. If they use a different beta per word, then maybe in the denominator it should say the sum of all beta_w, instead of beta * W?
Related Question