Bayesian Statistics – Evaluating the Beta Distribution as a Conjugate Prior for Binomial Distribution

bayesianbeta distributionbinomial distributionconjugate-prior

I understand that the Beta Distribution is a 'natural conjugate' of the Binomial distribution, in sense that the Posterior Distribution is proportional to the multiplication of both.

$$ Posterior(\theta | X) \propto Likelihood(X|\theta) \cdot Prior(\theta) $$

$$ \pi(\theta | X) \propto P(X=x | \theta) \cdot \pi(\theta) $$

$$ \pi(\theta | X) \propto \big{[} \binom{n}{x} \theta^x (1-\theta)^{n-x} \big{]} \cdot \big{[} \theta^{\alpha-1} (1-\theta)^{\beta-1} \big{]} $$

$$ \theta | X \sim \text{Beta}(x + \alpha, n – x + \beta) $$

with $x$ the number of successes in $n$ independent Bernoulli trials with success probability $\theta$. $\alpha$ and $\beta$ prior parameters of the beta distribution.

But I have also seen some people scaling up/down the impact of each unit of information into the posterior distribution:

$$ \theta | X \sim \text{Beta}(c \cdot x + \alpha, c \cdot (n – x) + \beta) $$

In that sense, the posterior distribution can adjust the strength of the evidence provided by the data by scaling the successes and failures (With a big $c$ increasing the effect of the data, small $c$ decreasing the effect of the data).

The fact that there is an (arbitrary?) $c$ scaling up and down the posterior distribution makes me think that the Beta distribution, even though convenient because of the conjugacy, may not be a good representation of the distribution of $\theta$, otherwise, why tuning it? Is there something I am missing/ misinterpreting?

Just for a visual representation:
let's define $\alpha = 1$, $\beta = 1$, $n = 16$, $x = 6$

if $c = 0.5$:

c = 0.5

if $ c = 1 $
c = 1

if $c = 3$:
c = 3

Best Answer

The fact that there is an (arbitrary?) $c$ scaling up and down the posterior distribution makes me think that the Beta distribution ... may not be a good representation of the distribution of $\theta$, otherwise, why tuning it?

Why not tuning it?

It's intrinsic to the Bayesian approach to be subjective. There is no unique choice for a prior. The computation of the expression $P(\theta|x)$ requires a prior $P(\theta)$, and that prior will always be prior information outside of the experiment observations. Whether it stems from a tunable distribution family or from a fixed distribution.

Even when you have according to some standard a unique prior that can not be tuned, then it is a subjective choice to use that standard.

If anything, then tuning makes the prior more practical (and better). For example:

  • Starting with the Jeffreys prior, $\propto \theta^{1/2} (1-\theta)^{1/2}$, you obtain after an experiment a posterior $\propto \theta^{1/2+x} (1-\theta)^{1/2+n-x}$, and that posterior can serve as a prior for a new experiment. You don't have to use the Jeffreys prior all the time.

  • In the analysis of the covid vaccine, where a conservative prior was chosen (on with prior knowledge/believe opposite to what one wants to prove with the experiment) described in this question Which statistical model is being used in the Pfizer study design for vaccine efficacy?

But I have also seen some people scaling up/down the impact of each unit of information into the posterior distribution:

$$ \theta | X \sim \text{Beta}(c \cdot x + \alpha, c \cdot (n - x) + \beta) $$

This isn't exactly the tuning of the prior, and is more like an extension to the Bayesian analysis as a whole (diverging from it, it isn't the same anymore). The $c$ parameter is changing the likelihood function, and not the prior. This additional tuning paramter $c$ is a trick outside of the Bayesian framework.

What distribution/likelihood is this exactly? It needs to be a function of a form that can be partitioned like $$f(x|\theta) = g(x) h(x|\theta)$$ where $$h(x|\theta) = \theta^{cx}(1-\theta)^{c(n-x)}$$

We might see it as an exponential dispersion family. For the binomial distribution this has been described in another question: What is the dispersion parameter of binomial distribution?).

If we use that likelihood function, then the form of the dispersed binomial distribution becomes

$$f(x|\theta,c) = h(x,c) \exp\left(\frac{\theta x + A(\theta)}{1/c} \right)$$

with

$$\begin{array}{} h(x,c) &=& {n \choose x} \\ \theta &=& \log(p/(1-p)) \\ A(\theta) &=& n \log(1+\exp(\theta)) \end{array}$$

This is not a true distribution (and we can not correct it by normalizing it, which would change $A(\theta)$, and the likelihood).

The likelihood function with that parameter $c$ is a quasi likelihood function. The parameter $c$ can be considered as a pragmatic approach to tuning the dispersion of the binomial distribution.

Related Question