Bayesian Analysis – How to Multiply a Likelihood by a Prior in Python

bayesianlikelihoodpriorpython

I'm trying to wrap my brain about computations in bayesian stats. The concept of multiplying a prior by a likelihood is a bit confusing to me, especially in a continuous case.

As an example, suppose I believe that heights of men in the US are truly distributed as ~N(mu=5.5,sigma=1). And my prior belief is that they are distributed as ~N(mu=5,sigma=1).

Using a pdf function, I can compute the likelihood of seeing each observation given the parameter as follows (python code):

First, samples from the true distribution.

import numpy as np
## loc = mu, scale = sigma, sample size = size
heights = np.random.normal(loc=5.5,scale=1,size=1000) 

Then the likelihood

def pdf(data, mean=5, variance=1):
    den = (np.sqrt(2*np.pi*variance))
    num = np.exp(-(np.square(data - mean)/(2*variance)))
    return num/den

likelihoods = [pdf(obs) for obs in heights]

Let's look at the first 10 heights and their likelihoods:

[(5.426044952743029, 0.36432983407060887),
 (5.7354234636458585, 0.30441530694083374),
 (2.6187512313984795, 0.02342125390815794),
 (4.048376000047023, 0.25366706186458265),
 (5.654522163377861, 0.3220211139284403),
 (5.051880755747615, 0.3984057424429508),
 (6.038515919083698, 0.2326555628191281),
 (6.220977020106613, 0.1893172736081514),
 (4.557736652986651, 0.3617734950544695),
 (5.601408005492896, 0.33294288249916787)]

Now…I'm confused on how I multiply this finite set of likelihoods with a continuous prior distribution. What actually is happening?

Best Answer

Perhaps the multiplication of 'prior' by 'likelihood' to obtain 'posterior' will be clearer if we make a careful comparison of (a) a familiar elementary application of Bayes' Theorem for a finite partition with (b) the use of a continuous version of Bayes' Theorem for inference on a parameter.

Bayes' Theorem with a finite partition. Let's begin with a Bayesian problem based on a finite partition. Your factory makes widgets and has $K$ machines: $A_1, A_2, \dots, A_K.$ Every widget is made by exactly one of these machines, so the $K$ machines can be viewed as a finite partition.

(a) The machines run at various speeds. The $j$th machine makes the (prior) proportion $P(A_j)$ of widgets, $j = 1,2,\dots K,$ where $\sum_j P(A_j)=1.$

(b) Machines are of varying quality. The likelihood of a defective widget from machine $A_i,$ is $P(D|A_i).$

(c) If we observe that a widget randomly chosen from the warehouse is defective, then the (posterior) probability that widget was made by machine $A_j$ is $$P(A_j | D) = P(A_jD)/P(D) = P(A_j)P(D|A_j)/C$$ where $C = P(D) = \sum_i P(A_iD) = \sum_i P(A_i)P(D|A_i).$

We can say that the expression on the right in the displayed equation is the product of the prior probabilities and likelihood, divided by a constant. Here the likelihood is based on data, the observation that the widget from the warehouse is defective. Thus, suppressing the constant, we could say that the posterior distribution is proportional to the product of the prior distribution and the likelihood, and write $P(A_i|D) \propto P(A_i) \times P(D|A_i).$

However, in discrete Bayesian applications, it is unusual to suppress the constant---because it is an easily computed sum and because it is needed to get numerical results.

Continuous Bayesian situation. Suppose you want to get an interval estimate of a binomial Success probability $\theta,$ where $0 < \theta < 1.$

(a) You have a prior distribution on $\theta,$ which is viewed as a random variable. Say that the density function $$f(\theta) = \frac{\Gamma(330+270)}{\Gamma(330)\Gamma(270)}\theta^{330-1}(1-\theta)^{270-1},$$ for $0 < \theta < 1,$ is that of $\mathsf{Beta}(330, 270).$ We use a beta prior distribution because it has support $(0,1)$ and we choose this particular beta distribution because it puts 95% of its probability in the interval $(0.51, 0.59),$ which matches our prior opinion that $\theta$ is slightly above $1/2.$ (Other similar beta distributions might have been chosen, but this one seems about right.) In R:

diff(pbeta(c(.51,.59),330,270))
[1] 0.9513758

(b) Then we do an experiment (perhaps, take a poll or test for prevalence of a disease), in which we observe $x = 620$ 'Successes' within $n = 1000$ trials. So the binomial likelihood function is based on a binomial PDF viewed as a function of $\theta,$ denoted $$g(x|\theta) = {1000 \choose 620}\theta^{620}(1-\theta)^{n-620}.$$

(c) The 'continuous' version of Bayes' Theorem can be stated as follows: $$h(\theta|x) = \frac{f(\theta)g(x|\theta)}{\int f(\theta)g(x|\theta)\, d\theta} = \frac{f(\theta)g(x|\theta)}{C} \propto f(\theta) \times g(x|\theta).$$

This is often summarized as $\mathrm{POSTERIOR}\propto \mathrm{PRIOR}\times\mathrm{LIKELIHOOD}.$ (The symbol $\propto$ is read as "proportional to".)

In the current particular application, we can avoid evaluating the integral $C$ because the beta prior distribution is 'conjugate to' (mathematically compatible with) the binomial likelihood. This makes it possible to recognize the right hand side of the last displayed equation as $$h(\theta|x) = f(\theta)g(x|\theta) \propto \theta^{330+620-1}(1-\theta)^{270-(1000-620)-1}\\ = \theta^{950-1}(1-\theta)^{650-1},$$ which is proportional to the density function of $\mathsf{Beta}(950,650).$ Of course, the integral can be evaluated by analytic or computational means, but it is convenient when we don't need to evaluate the constant $C.$

Finally, we can say that a 95% Bayesian posterior probability interval (also called 'credible interval') is $(0.570, 0.618).$ Specific endpoints of this interval are influenced both by the prior distribution and (somewhat more strongly) by the data from our experiment.

qbeta(c(.025,.975), 950,650)
[1] 0.5695848 0.6176932

If we had used the 'non-informative' Jeffreys' prior $\mathsf{Beta}(.5,.5),$ then the 95% posterior interval estimate from our experiment would have been $(0.590, 0.650).$

qbeta(c(.025,.975), 620.5, 380.5)
[1] 0.5896044 0.6497021