Solved – How to estimate $\alpha$ parameter of a latent dirichlet allocation model

dirichlet distributionestimationtopic-models

Blei has shown that it is possible to estimate $\alpha$ in a LDA model, but I have yet to find a library (any library; C, C++, Java, …) to do so. Usually, implementations (including Blei's) treat $\alpha$ as a hyperparameter with uniform values, and it is the $\theta$ that is usually estimated, where $\theta$ is a $D \times K$ matrix. It represents each document's topic distribution. However, I am more interested in modeling with the original LDA model where $\alpha$ is used as the parameter for dirichlet distribution of topic distributions, but I am currently stuck at the abyss of mathematical equations in Blei's paper. How should I go on and solve this problem?

Best Answer

$\ \newcommand{\Mult}{\mathrm{Mult}} \newcommand{\Dir}{\mathrm{Dir}} \newcommand{\R}{\mathbb{R}} $ Let's briefly review the (unsmoothed) model of Latent Dirichlet Allocation, as presented in [Blei 2003].

[Blei 2003] Blei, David M., Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research 3 (2003): 993-1022.

For fixed vocabulary length $V$ and $K$ topics, $M$ documents, each with $N$ words,

$$ \begin{align} \alpha &\in \R^K_+ & \text{Concentration Parameter} \\ \forall\, k \quad \beta_k &\in \R^{V} & \text{Topic Probability Vectors} \\ \forall\, d \quad \theta_d | \alpha &\sim \Dir(\alpha) & \text{Per-doc Topic Mixing Proportions} \\ \forall\, d,w \quad z_{dw} | \theta_d &\sim \Mult(\theta) & \text{Per-word Topic Indicators} \\ \forall\, d,w \quad w_{dw} | \theta_d &\sim \beta_{z_{dw}} & \text{Words (Observed)} \end{align} $$

Alternatively, $\beta=(\beta_1,\dots,\beta_K)$ and $\Theta=(\theta_1,\dots,\theta_M)$ can be represented as $K\times V$ and $D\times K$ probability matrices, respectively, where rows sum to one. $\mathbf{Z}=(z_{dj})_{d,j}$ and $\mathbf{W}=(w_{dj})_{d,j}$ can also be represented as matrices.

We model the documents themselves as mixtures of multinomial distributions. The mixing proportions $\theta_d$ for each document $d$ are conditionally independent of the mixing proportions for all other documents, given the concentration parameters $\alpha$. From [Blei 2003]:

The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized as a distribution over words.

Typically, one observes the words $\mathbf{W}$ and is interested in inferring the values of the unknown variables $\Theta$, $\mathbf{Z}$ and $\beta$, and optionally the hyperparameters $\alpha$. In the paper, the inference procedure is presented as a Variational Expectation-Maximization algorithm, which essentially divides inference into two steps (c.f. Section 5.3), performed iteratively until convergence:

  1. (E-Step) Variational inference to (essentially) estimate the local hidden variables $\Theta$ and $\mathbf{Z}$ given estimates of $\alpha$ and $\beta$. This is done by optimizing over auxiliary variational parameters, whose expectation yields estimates of each hidden variable.
  2. (M-Step) Maximum likelihood estimation of global hidden variables $\alpha$ and $\beta$, given estimates of $\Theta$ and $\mathbf{Z}$.

The key here is that estimation on $\alpha$ is done as though we have complete information about the states of the local hidden variables. The hyperparameter $\alpha$ is the concentration parameter to a Dirichlet distribution, from which the multinomial probability vectors $\theta_d$ are "generated".

Crucially, the $\theta_d$ are all conditionally independent given $\alpha$, and $\alpha$ is conditionally independent of all other variables in the model given $\Theta$ (check the plate diagram!). Thus, estimation on $\alpha$ reduces to the problem of estimating the parameter of a Dirichlet distribution given a set of multinomial observations. The following papers derive algorithms to this end:

[Minka 2000] Minka, Thomas. Estimating a Dirichlet distribution. (2000): 3.

[Huang] Huang, Jonathan. Maximum likelihood estimation of Dirichlet distribution parameters.

For an example implementation of the $\alpha$-estimation procedure, you should check out gensim, which directly implements the Newton-Raphson update described in [Huang].

I would definitely recommend reading up about Expectation Maximization and Variational Inference if you aren't already familiar; intuition about these techniques is key to understanding what's happening in Blei's paper.

Related Question