Solved – Maximum Likelihood Estimation of Dirichlet Mean

dirichlet distributionmaximum likelihood

Consider the problem of computing a Maximum-Likelihood estimate of the parameters to a finite Dirichlet distribution, given a set of multinomial observations (probability vectors) assumed to have been sampled from a Dirichlet. The following paper provides an iterative fixed-point algorithm to estimate the mean and precision of the Dirichlet separately:

Minka, Thomas. Estimating a Dirichlet distribution. (2000): 3.

The algorithm for estimating the mean $\mathbf{m}$, given fixed precision $s$, is summarized as follows:

  1. Estimate the full concentration parameters $\pmb{\alpha}=s\mathbf{m}$ by inverting the digamma function.
  2. $\forall\, k$, set $m_k^{new} = \frac{\alpha_k}{\sum_j \alpha_j}$.
  3. Repeat until convergence.

Why must we resort to an iterative algorithm to find the mean? Is the average of our observed data vectors not an accurate estimate of the mean? Further, is it also not true that the expected value of the mean of a set of samples from a Dirichlet is the mean of the Dirichlet itself?

Any insight here would be appreciated!

Best Answer

Suppose $\mathbf p_1, \ldots, \mathbf p_n$ are iid $\operatorname{Dirichlet}(s \mathbf m)$. If I'm understanding you correctly, your question is "why use an iterative scheme when $\hat {\mathbf m} = \frac 1 N \sum_{i = 1} ^ N \mathbf p_i$ works?" You are correct that this is a reasonable estimator. But it isn't the maximum likelihood estimator, which is what we care about! The Dirichlet likelihood is $$ L_i(\pmb \alpha) = \frac{\Gamma(\sum_k \alpha_k)}{\prod_k\Gamma(\alpha_k)} \prod_k p_{ik}^{\alpha_k - 1} $$ so our goal is to maximize $\prod_i L_i (\pmb \alpha)$ in $\pmb \alpha$; once we do this, we can get the maximum likelihood estimate of $\mathbf m$ by normalizing. But it is easy to see that the likelihood is a function of $\frac 1 N \sum_i \log \mathbf p_i$ rather than $\frac 1 N \sum_i \mathbf p_i$ (I'm using $\log$ elementwise here). In some sense, we might think of $\log \mathbf p_i$ as the "appropriate scale" of the data - at least, for the Dirichlet distribution - rather than the untransformed $\mathbf p_i$.

So, we believe that the MLE is not $\frac 1 N \sum_i \mathbf p_i$ but rather is some complicated function of $\frac 1 N \sum_i \log \mathbf p_i$. The question now becomes "why use the MLE rather than the easy estimator?" Well, we have some theorems which say the MLE has certain optimality properties. So, we get a more efficient estimator with the MLE, although $\frac 1 N \sum_i \mathbf p_i$ may still be useful as a starting point for the iterative algorithm. Now, I'm not sure how good the MLE really is here, considering that the data must actually be Dirichlet distributed for it to work, whereas $\frac 1 N \sum \mathbf p_i$ is consistent no matter what. But that is another story.

Related Question