Blei has shown that it is possible to estimate $\alpha$ in a LDA model, but I have yet to find a library (any library; C, C++, Java, …) to do so. Usually, implementations (including Blei's) treat $\alpha$ as a hyperparameter with uniform values, and it is the $\theta$ that is usually estimated, where $\theta$ is a $D \times K$ matrix. It represents each document's topic distribution. However, I am more interested in modeling with the original LDA model where $\alpha$ is used as the parameter for dirichlet distribution of topic distributions, but I am currently stuck at the abyss of mathematical equations in Blei's paper. How should I go on and solve this problem?
Solved – How to estimate $\alpha$ parameter of a latent dirichlet allocation model
dirichlet distributionestimationtopic-models
Best Answer
$\ \newcommand{\Mult}{\mathrm{Mult}} \newcommand{\Dir}{\mathrm{Dir}} \newcommand{\R}{\mathbb{R}} $ Let's briefly review the (unsmoothed) model of Latent Dirichlet Allocation, as presented in [Blei 2003].
For fixed vocabulary length $V$ and $K$ topics, $M$ documents, each with $N$ words,
$$ \begin{align} \alpha &\in \R^K_+ & \text{Concentration Parameter} \\ \forall\, k \quad \beta_k &\in \R^{V} & \text{Topic Probability Vectors} \\ \forall\, d \quad \theta_d | \alpha &\sim \Dir(\alpha) & \text{Per-doc Topic Mixing Proportions} \\ \forall\, d,w \quad z_{dw} | \theta_d &\sim \Mult(\theta) & \text{Per-word Topic Indicators} \\ \forall\, d,w \quad w_{dw} | \theta_d &\sim \beta_{z_{dw}} & \text{Words (Observed)} \end{align} $$
Alternatively, $\beta=(\beta_1,\dots,\beta_K)$ and $\Theta=(\theta_1,\dots,\theta_M)$ can be represented as $K\times V$ and $D\times K$ probability matrices, respectively, where rows sum to one. $\mathbf{Z}=(z_{dj})_{d,j}$ and $\mathbf{W}=(w_{dj})_{d,j}$ can also be represented as matrices.
We model the documents themselves as mixtures of multinomial distributions. The mixing proportions $\theta_d$ for each document $d$ are conditionally independent of the mixing proportions for all other documents, given the concentration parameters $\alpha$. From [Blei 2003]:
Typically, one observes the words $\mathbf{W}$ and is interested in inferring the values of the unknown variables $\Theta$, $\mathbf{Z}$ and $\beta$, and optionally the hyperparameters $\alpha$. In the paper, the inference procedure is presented as a Variational Expectation-Maximization algorithm, which essentially divides inference into two steps (c.f. Section 5.3), performed iteratively until convergence:
The key here is that estimation on $\alpha$ is done as though we have complete information about the states of the local hidden variables. The hyperparameter $\alpha$ is the concentration parameter to a Dirichlet distribution, from which the multinomial probability vectors $\theta_d$ are "generated".
Crucially, the $\theta_d$ are all conditionally independent given $\alpha$, and $\alpha$ is conditionally independent of all other variables in the model given $\Theta$ (check the plate diagram!). Thus, estimation on $\alpha$ reduces to the problem of estimating the parameter of a Dirichlet distribution given a set of multinomial observations. The following papers derive algorithms to this end:
For an example implementation of the $\alpha$-estimation procedure, you should check out gensim, which directly implements the Newton-Raphson update described in [Huang].
I would definitely recommend reading up about Expectation Maximization and Variational Inference if you aren't already familiar; intuition about these techniques is key to understanding what's happening in Blei's paper.