[Math] What can be said about an infinite linear chain of conjugate prior distributions

bayesian-probabilitypr.probabilityprobability distributions

We can sample a discrete value from the multinomial distribution.

We can also sample the parameters of the multinomial distribution from its conjugate prior the dirichlet distribution.

Since the dirichlet distribution is part of the exponential family, it too must have a conjugate prior distribution in the exponential family.

I hope you see where I'm going: what happens as this chain of priors is taken to infinity?

For a simpler example, what happens with the self-conjugate Gaussian distribution?

Best Answer

Let's say that you have a distribution $F$ in the exponential family with density \begin{align} \newcommand{\mbx}{\mathbf x} \newcommand{\btheta}{\boldsymbol{\theta}} f(\mbx \mid \btheta) &= \exp\bigl(\eta(\btheta) \cdot T(\mbx) - g(\btheta) + h(\mbx)\bigr) \end{align}

Given independent realizations $\{x_1, x_2, \dotsc, x_n\}$ of $F$ (with unknown parameter $\theta$), then the distribution over $\theta$, $F'$, is the conjugate prior of $F$. The density of $F'$ is \begin{align} f(\btheta \mid \boldsymbol\phi) = L(\btheta \mid \mbx_1, \dotsc, \mbx_n) &= f(\mbx_1, \dotsc, \mbx_n \mid \btheta) \\\\ &\propto \prod_i f(\mbx_i\mid \btheta) \\\\ &= \textstyle\prod_i\exp\Bigl(\eta(\btheta) \cdot \textstyle T\left(\mbx_i\right) - g(\btheta) + h(\mbx_i)\Bigr) \\\\ &\propto \textstyle\prod_i\exp\Bigl(\eta(\btheta) \cdot \textstyle T\left(\mbx_i\right) - g(\btheta)\Bigr) \\\\ &= \textstyle\exp\Bigl(\eta(\btheta) \cdot \bigl(\textstyle\sum_iT\left(\mbx_i\right)\bigr) - ng(\btheta)\Bigr) \\\\ &= \exp\bigl(\eta'(\boldsymbol \phi) \cdot T'(\btheta)\bigr) \end{align} where \begin{align} \eta'(\boldsymbol\phi) &= \begin{bmatrix} \sum_iT_1(\mbx_i) \\\\ \vdots \\\\ \sum_iT_k(\mbx_i) \\\\ \sum_i1 \end{bmatrix} & T'(\btheta) &= \begin{bmatrix} \eta_1(\btheta) \\\\ \vdots \\\\ \eta_k(\btheta) \\\\ -g(\btheta) \end{bmatrix}. \end{align} Thus, $F'$ is also in the exponential family ($T'$ replaced $\eta$ and $\eta'$ replaced $T$ since this distribution is over $\theta$ the parameter of the distribution over $x$.)

Interestingly, $\boldsymbol\phi$ has exactly one more parameter than $\btheta$ except in the rare case where natural parameter $\phi_{k+1}$ is redundant, but such a distribution would be very weird (it would mean that the number of observations $\mbx$, that is, $n$, tells you nothing about $\btheta$.)

So, to answer your question, with each conjugate prior you get exactly one more hyperparameter.

There are many conjugate priors of the Gaussian distribution depending on how you look at it. In my opinion, the analogy to the Multinomial-Dirichlet example would set things up as follows: assume that $n$ real-valued numbers are generated by a Gaussian with unknown mean and variance. Then, the distribution of the mean and variance given the data points is a three-parameter conjugate prior distribution whose sufficient statistics are the total of the samples, the total of the squares of the samples, and the number of samples.

Related Question