What are good ranges for the hyperparameters $\alpha$ and $\beta$ (explained well here) in LDA?

I appreciate hyperparameter tuning always depends on the use case, data, content of documents etc., but is there any general rule or heuristic to choose these hyperparameters for LDA?

Additional Info

For extra info on my particular use case and data (although I'd like a generalizeable answer if possible):

  1. 29 documents with an average length of 5,177 words (after parsing). This number of documents is expected to grow to between 50-200.

  2. 3,500 unique words (after parsing and keeping the top 3,500 words by frequency)

  3. 155,309 total words (again, after parsing)

  4. All documents are finance related, and more specifically investment outlook whitepapers. So there isn't a lot of "variety" between documents

This is quite a small dataset, but I think there's enough words and structure in each document to train an LDA model (if not, please let me know).

Choice of $\alpha$ and $\beta$ is indeed tricky, since it impacts the topic modeling results. The Gibbs sampling paper by Griffiths et al. gives some insight into this:

The value of $\beta$ thus affects the granularity of the model: a corpus of documents can be sensibly factorized into a set of topics at several different scales, and the particular scale assessed by the model will be set by $\beta$. With scientific documents, a large value of $\beta$ would lead the model to find a relatively small number of topics, perhaps at the level of scientific disciplines, whereas smaller values of $\beta$ will produce more topics that address specific areas of research.

Eventually for scientific documents, the authors chose the following hyper-parameters, $\beta=0.1$ and $\alpha=50/T$. But they had a corpus of around $28K$ documents and a vocabulary of $20K$ words, and they tried several different values of $T: [50, 100, 200, 300, 400, 500, 600, 1000]$.

Regarding your data. I have no experience with analyzing financial text data, but for the choice of $\alpha$ and $\beta$, I would ask myself the following questions:

  • Given my word vocabulary, do I expect my resultant topics to be sparse? For most cases, this is true. Hence, typically the topic prior is chosen to be sparse with $\beta < 1$.
  • Given the topics, do I expect the distribution of topics in each document to be sparse? That is, each document only represents a few topics. If yes, then $\alpha < 1$.

Answering the above questions may not be straight-forward with limited knowledge of the data. Since you have limited data, I would choose multiple values of $\alpha$ and $\beta$ - ranging from sparse to non-sparse priors - and find which one suits the dataset by computing the perplexity over some hold-out data. To put it more concretely:

  • Choose $\alpha_m$ from $[0.05, 0.1, 0.5, 1, 5, 10]$
  • Choose $\beta_m$ from $[0.05, 0.1, 0.5, 1, 5, 10]$
  • Run topic modeling on training data, with $(\alpha_m, \beta_m)$ pair
  • Find model perplexity on hold-out test data
  • Choose the value of $\alpha_m$ and $\beta_m$ with the minimum perplexity


