Bayesian – Why Are Prior Distributions Sometimes Conditional Probabilities?

bayesianconditional probabilityhierarchical-bayesianphylogeny

I came across the following Bayesian equation in textbook of evolutionary biology:

$f(t, r, \theta|X) \propto f(X|t, r, \theta)f(t|\theta)f(r|t,\theta)f(\theta)$

$f(X|t, r, \theta)$ is the likelihood and $f(t|\theta)$ is the prior distribution on times $t$ $f(r|t,\theta)$ is the prior distribution on rates $r$, and $f(\theta)$ is the prior of the substitution model.

I'm having trouble understanding what it means for a prior distribution to be conditioned on other values, as we have in $f(t|\theta)$ and $f(r|t,\theta)$. Does this mean that our prior beliefs about $t$ depend on $\theta$ and those about $r$ depend on both $t$ and $theta$? I.e., that without knowing $t$ and $theta$, we have no prior beliefs about $r$?

Does the specification of the priors in this way make this an example of a hierarchical model? If so, what exactly is the hierarchy?

If anyone can recommend any literature that might help me understand this, I'd be grateful for it.

Best Answer

It is often easier to reason about one random quantity at a time than work with all random quantities simultaneously. In Bayesian statistics, where everything is a random quantity, this is especially true. You often have to fix one random quantity to work with another. In more technical terms, it is often easier to work with conditional distributions than joint distributions.

You can think of conditional probability as tool to set a random quantity to a particular, fixed value. So rather than thinking about $f(r | t, \theta)$ as prior knowledge of $r$ depending on $t$ and $\theta$, imagine this conditional distribution as a way to reason about $r$ without the interference of random fluctuations in $t$ and $\theta$. You set $t$ and $\theta$ to specific values $t^*$ and $\theta^*$ while allowing $r$ to vary: $f(r | t = t^*, \theta = \theta^*)$

As @Peter Pang notes, you can factor the joint prior distribution differently than in your original post: $$ \begin{aligned} p(t, r, \theta) &= p(r \vert \theta, t)p(\theta \vert t)p(t)\\ &= p(\theta \vert r, t)p(r \vert t)p(t)\\ &= p(\theta \vert r, t)p(t \vert r)p(r)\\ &=\vdots \end{aligned} $$ Depending on the specific problem you're working on, it may be simpler (conceptually, mathematically, or numerically) to work with the distribution of, say, $f(t | r, \theta)$ instead of $f(r | t, \theta)$. Since the joint prior can be factor differently, it is always your option choose what quantities, if any, are fixed in place (i.e. conditioned on) at each step in the factorization.

Related Question