What is the conjugate prior distribution of the Dirichlet distribution?
Statistics – Conjugate Prior of the Dirichlet Distribution
pr.probabilityst.statistics
Related Solutions
There are many approaches to this problem. Here are three.
The subjective Bayes approach says the prior should simply quantify what is known or believed before the experiment takes place. Period. End of discussion.
The empirical Bayes approach says you can estimate your prior from the data itself. (In that case your "prior" isn't prior at all.)
The objective Bayes approach says to pick priors based on mathematical properties, such as "reference" priors that in some sense maximize information gain. Jim Berger gives a good defense of objective Bayes here.
In practice someone may use any and all of these approaches, even within the same model. For example, they may use a subjective prior on parameters where there is a considerable amount of prior knowledge and use a reference prior on other parameters that are less important or less understood.
Often it simply doesn't matter much what prior you use. For example, you might show that a variety of priors, say an optimistic prior and a pessimistic prior, lead to essentially the same conclusion. This is particularly the case when there's a lot of data: the impact of the prior fades as data accrue. But for other applications, such as hypothesis testing, priors matter more.
Let's say that you have a distribution $F$ in the exponential family with density \begin{align} \newcommand{\mbx}{\mathbf x} \newcommand{\btheta}{\boldsymbol{\theta}} f(\mbx \mid \btheta) &= \exp\bigl(\eta(\btheta) \cdot T(\mbx) - g(\btheta) + h(\mbx)\bigr) \end{align}
Given independent realizations $\{x_1, x_2, \dotsc, x_n\}$ of $F$ (with unknown parameter $\theta$), then the distribution over $\theta$, $F'$, is the conjugate prior of $F$. The density of $F'$ is \begin{align} f(\btheta \mid \boldsymbol\phi) = L(\btheta \mid \mbx_1, \dotsc, \mbx_n) &= f(\mbx_1, \dotsc, \mbx_n \mid \btheta) \\\\ &\propto \prod_i f(\mbx_i\mid \btheta) \\\\ &= \textstyle\prod_i\exp\Bigl(\eta(\btheta) \cdot \textstyle T\left(\mbx_i\right) - g(\btheta) + h(\mbx_i)\Bigr) \\\\ &\propto \textstyle\prod_i\exp\Bigl(\eta(\btheta) \cdot \textstyle T\left(\mbx_i\right) - g(\btheta)\Bigr) \\\\ &= \textstyle\exp\Bigl(\eta(\btheta) \cdot \bigl(\textstyle\sum_iT\left(\mbx_i\right)\bigr) - ng(\btheta)\Bigr) \\\\ &= \exp\bigl(\eta'(\boldsymbol \phi) \cdot T'(\btheta)\bigr) \end{align} where \begin{align} \eta'(\boldsymbol\phi) &= \begin{bmatrix} \sum_iT_1(\mbx_i) \\\\ \vdots \\\\ \sum_iT_k(\mbx_i) \\\\ \sum_i1 \end{bmatrix} & T'(\btheta) &= \begin{bmatrix} \eta_1(\btheta) \\\\ \vdots \\\\ \eta_k(\btheta) \\\\ -g(\btheta) \end{bmatrix}. \end{align} Thus, $F'$ is also in the exponential family ($T'$ replaced $\eta$ and $\eta'$ replaced $T$ since this distribution is over $\theta$ the parameter of the distribution over $x$.)
Interestingly, $\boldsymbol\phi$ has exactly one more parameter than $\btheta$ except in the rare case where natural parameter $\phi_{k+1}$ is redundant, but such a distribution would be very weird (it would mean that the number of observations $\mbx$, that is, $n$, tells you nothing about $\btheta$.)
So, to answer your question, with each conjugate prior you get exactly one more hyperparameter.
There are many conjugate priors of the Gaussian distribution depending on how you look at it. In my opinion, the analogy to the Multinomial-Dirichlet example would set things up as follows: assume that $n$ real-valued numbers are generated by a Gaussian with unknown mean and variance. Then, the distribution of the mean and variance given the data points is a three-parameter conjugate prior distribution whose sufficient statistics are the total of the samples, the total of the squares of the samples, and the number of samples.
Best Answer
Neil sent me an email asking:
===
I read your post at http://www.stat.columbia.edu/~cook/movabletype/archives/2009/04/conjugate_prior.html and I was wondering if you could expand on how to update the Dirichlet conjugate prior that you provided in your paper:
S. Lefkimmiatis, P. Maragos, and G. Papandreou, Bayesian Inference on Multiscale Models for Poisson Intensity Estimation: Applications to Photon-Limited Image Denoising, IEEE Transactions on Image Processing, vol. 18, no. 8, pp. 1724-1741, Aug. 2009
In other words, given in your paper's notation the prior hyper-parameters (vector $\mathbf{v}$, and scalar $\eta$), and $N$ Dirichlet observations (vectors $\mathbf{\theta}_n, n=1,\dots,N$), how do you update $\mathbf{v}$ and $\eta$?
===
Here is my response:
Conjugate pairs are so convenient because there is a standard and simple way to incorporate new data by just modifying the parameters of the prior density. One just multiplies the likelihood with its conjugate prior; the result has the same parametric form as the prior, and the new parameters can be readily "read-off" by comparing the likelihood-prior product with the prior parametric form. This is described in detail in all standard texts in Bayesian statistics such as Gelman et al. (2003) or Bernardo and Smith (2000).
In the case of the Dirichlet and its conjugate prior described in our paper and using its notation, after observing $N$ Dirichlet vectors $\mathbf{\theta}_n$, $n=1,\dots,N$, where each vector $\mathbf{\theta}_n$ is $D$ dimensional with elements $\theta_n[t]$, $t=1,\dots,D$, the $D+1$ hyper-parameters should be updated as follows:
You can verify this in a few lines of equations by following the previously described general rule.
Hope this helps!