Solved – How high is “high” and how low is “low” in Latent Dirichlet Allocation Alpha and Eta hyperparameters? – LDA

hyperparametertopic-models

In relation to this question and answer, the default value for the Python LDA for alpha is 0.1 and eta is 0.01. Is this supposed to be the normal value? If yes, then how far does low and high goes for alpha and eta? Like let's say for alpha, is there a dramatic change in results when transitioning from 0.1 to 0.2?

Given these frequencies where the first column is the number of words per document and second column is number of documents with the same number of words per document count, can you give an estimate values for alpha and eta?

19 – 1
18 – 3
17 – 9
16 – 14
15 – 24
14 – 109
13 – 288
12 – 880
11 – 1056
10 – 1582
9 – 2681
8 – 3687
7 – 3668
6 – 4797
5 – 4566
4 – 6028

Best Answer

To your first question, the LDA implementations in both R's topicmodels and python's sci-kit learn set a default value inversely proportional to the number of topics, $\frac{50}{k}$ and $\frac{1}{k}$, respectively. Not a comprehensive survey, by any means, but it suggests it's common to choose the parameters based on the number of topics, and not an arbitrary constant.

To the question of how high or low these values can go, both hyperparameters correspond to Dirichlet distributions: You can choose any value permissible as a Dirichlet distribution parameter. Any positive real will do, and it can go as high as you like.

To the question of expected change from these hyperparameters, a bit on their role in LDA. Both control sparsity at different levels of the model.

Recall that in LDA, for every document, the topic proportion $\theta$ is chosen from $Dir(\boldsymbol{\alpha})$. (With $\boldsymbol{\alpha}$ a $k$-length vector with every $\boldsymbol{\alpha_i} = \alpha$, $k$ the number of topics.) Do most documents have a few topics, or many? This is what your choice of $\alpha$ should capture: your sense of how topically sparse each document is.
In smoothed LDA, each topic $\beta_k$ is drawn from a $Dir(\boldsymbol{\eta})$, and words from this topic are drawn from a categorical distribution with parameter $\beta_k$. Meaning, this expresses how word-sparse we expect topics to be.

For intuition, it's helpful to look at how values drawn from a Dirichlet with different parameter choices look. In the image below, each subplot is a draw from a different choice of parameter. The parameter grows as you look right and down, starting $\frac{1}{20}$ and doubling with each subplot. (The second and third have $\alpha = 0.1, 0.2$ respectively, to correspond to the values in your question.)

You can see from this that smaller parameter values will produce sparser draws. ("Sparse" in that more values of the topic proportion will be close to zero.) As the choice of value for the parameter grows, draws from the distribution will become more uniform, in that each entry of the vector drawn from $Dir(\boldsymbol\alpha)$ will vary less and less around $\frac{1}{k}$. (You can also see this from the variance expression for a Dirichlet—see wiki link above—that the variance of each $X_i, X \sim Dir(\boldsymbol\alpha)$ decreases as $\alpha$ grows.)

To your last question, I don't see a way to recommend values from that information alone. (Especially without knowing more about the application.)

Code to reproduce plot

# -*- coding: utf-8 -*-
"""
Created on Mon May 30 13:19:12 2016

@author: SeanEaster
"""

from matplotlib import pylab as plt
import numpy as np

# Generate random Dirichlets with alpha = 1 / 10 times 2 ^ i, i from -1 to 7
randomDirichletSamples = [np.random.dirichlet(np.ones((10,)) / 10. * np.power(2., i - 1) ) for i in range(9)]
# find the max to use over all the plots
maxValue = np.array(randomDirichletSamples).max() * 1.1

# plot each random sample
for i, sample in enumerate(randomDirichletSamples):
    plt.subplot(3, 3, i + 1)
    plt.bar(np.arange(10), sample)
    plt.ylim((0, maxValue))

plt.show()

Best Answer

Code to reproduce plot

Related Solutions

Latent Dirichlet Allocation – How to Calculate Perplexity of a Holdout?

LDA – Natural Interpretation for LDA Hyperparameters

Related Question