Solved – Bayesian posterior: is multiplying likelihood by prior (rather than simulation) an acceptable approach

bayesian

Ken Rice has a helpful introductory set of slides available online called 'Bayesian Statistics (a very brief introduction)'.

http://faculty.washington.edu/kenrice/BayesIntroClassEpi515kmr2016.pdf

On slide 23 he gives this formulation, which comes directly from Bayes theorem:

Posterior ∝ Likelihood × Prior

However, within a section on 'when priors don't matter (much)', on slide 33 he describes a method whereby you multiply the likelihood function by the prior to get the posterior. But he describes this as "semi-Bayesian". (On slide 35, I think he's referring to the same thing when he mentions an "approximate Bayes" approach, and describing "full Bayes" as better.)

My question is: in what sense is taking a prior expressed as a functional form and multiplying it by a likelihood function only semi-Bayesian?

Is it just that the (normal) likelihood he presents is only an approximation to the real likelihood function? Or is it because the multiplication he presents is only approximate? Or is there something more fundamentally 'semi' about this type of Bayesianism?

More generally, the focus of texts on Bayesian inference seems to be on simulation (especially MCMC) approaches. Is this because it is 'wrong' to get your posterior distribution from multiplying a prior distribution by the likelihood function generated by some new data? Or is it because the analytical route is not often available to you?

Best Answer

Bayes theorem is

$$ \mathrm{posterior} \propto \mathrm{likelihood} \times \mathrm{prior} $$

so posterior is proportional to likelihood times prior. For it to be equal we need to multiply the right-hand side of the equation by a normalizing constant, so that it integrates to unity, what makes posterior a proper probability distribution.

Constant does not change anything about finding maximum of the function, since each possible output of the function is multiplied by the same constant, so if you are only interested in point estimate (maximum a posteriori estimate), then you can ignore the normalizing constant. However if you want to obtain proper posterior distribution, then it is needed and we often use MCMC to find it and solve the equation.

Edit

But, as noticed by Xi'an, what the slides that you refer to actually say is that the author by "semi-Bayesian" approach means using normal distribution as likelihood function and normal priors:

This makes computation very easy since using conjugate priors, but it may not be the best approximation for all cases (recall that normal distribution is continuous, symmetric, and reaches from $-\infty$ to $\infty$ -- this is not true for many different kinds of data!).

Related Solutions

Solved – Bayesian analysis with histogram prior. Why draw simulations from the posterior

To answer your subquestion: How to do the following more elegantly?

post.vector <- vector()
post.vector[1] <- sum(post[p < 0.1])
post.vector[2] <- sum(post[p > 0.1 & p <= 0.2])
post.vector[3] <- sum(post[p > 0.2 & p <= 0.3])
post.vector[4] <- sum(post[p > 0.3 & p <= 0.4])
post.vector[5] <- sum(post[p > 0.4 & p <= 0.5])
post.vector[6] <- sum(post[p > 0.5 & p <= 0.6])
post.vector[7] <- sum(post[p > 0.6 & p <= 0.7])
post.vector[8] <- sum(post[p > 0.7 & p <= 0.8])
post.vector[9] <- sum(post[p > 0.8 & p <= 0.9])
post.vector[10] <- sum(post[p > 0.9 & p <= 1])

The easiest way to do it using base R is:

group <- cut(p, breaks=seq(0,1,0.1), include.lowest = T)
post.vector.alt <- aggregate(post, FUN=sum, by=list(group))

Note that the breaks go from 0 to 1. This yields:

     Group.1            x
1    [0,0.1] 3.030528e-13
2  (0.1,0.2] 1.251849e-08
3  (0.2,0.3] 6.385088e-06
4  (0.3,0.4] 6.732672e-04
5  (0.4,0.5] 2.376448e-01
6  (0.5,0.6] 7.372805e-01
7  (0.6,0.7] 2.158296e-02
8  (0.7,0.8] 2.691182e-03
9  (0.8,0.9] 1.205200e-04
10   (0.9,1] 3.345072e-07

And we have:

> all.equal (post.vector.alt$x, post.vector)
[1] TRUE

Naive Bayes – When is a Naive Bayes Model Not Bayesian?

Informally, to be 'Bayesian' about a model (Naive Bayes just names a class of discrete mixture models) is to use Bayes theorem to infer the values of its parameters or other quantities of interest. To be 'Frequentist' about the same model is, roughly, and among other things, to use the sampling distribution of estimators that depend on those quantities to infer what those values might be.

Turning to your Naive Bayes / mixture model. For exposition, let's assume all the component parameters and functional forms are known and there are two components (classes, whatever).

What is described as the 'prior' in a mixture model is a mixing parameter in the early stages of a hierarchically structured generative model. If you estimate this mixing parameter in the usual (ML, i.e. Frequentist) way, via an EM algorithm, then you have taken a convenient route up the model likelihood to find a maximum, and used that as a point estimate of the true value of the mixing parameter. Maybe you use the curvature of the likelihood at that point to give yourself a measure of uncertainty. (But probably not). Typically you'd then use it to get membership probabilities for individual observations by assuming that value and applying Bayes theorem.

This seems Bayesian because it uses Bayes theorem. However, it is unBayesian in two ways: First, you used the same data to determine the 'prior' (the mixing parameter) and some relevant 'posteriors' (membership probabilities for individual observations). So the 'prior' isn't really prior because it's conditioned on the data already. In the second, more general way, of which the first is an instance: Bayes theorem is being used to infer some unknowns (membership probabilities) but not others (the mixing coefficient).

That's why if you decide to do this in a Bayesian fashion then, since you don't know what the mixing parameter value is in advance, you give it some prior distribution. Maybe that's a Dirichlet (hence a Beta in this stripped down exposition) with some parameters or other, set to reflect your uncertainty. Then you figure out how to condition on the data to get a posterior distribution over it and all the other stuff you care about but don't know, such as component memberships for each observation. To infer any subset of these, marginalize out the rest.

In Frequentist terms, there are known and unknown parts of the model, but no uncertain parts, so nothing needs a prior: you either know them e.g. the components are Gaussian, or you don't know them, e.g. the means of each component. Even when there are distributions involved in generating the data, as there are in the mixture model, none of them is a Bayesian prior, regardless of whether you use Bayes theorem on them. Rather they represent actual or hypothetical randomizing mechanisms of some sort. Specifically, the mixture model provides a hypothetical randomization scheme for generating data: Toss a coin weighted according to the value of the mixing parameter to decide on a component, then draw from that component's distribution to generate an observation. This whole process has parameters, and you have to estimate them from the data.

So what looks like 'posterior inference', with a 'prior', is actually regular inference where the data generating process has some distributional machinery in the middle.

This rather like the Frequentist take on mixed models, and unlike Frequentist inference for, say, a regression coefficient, where there is no such intermediate structure to make anybody think of priors or posteriors.

It might be worth noting that Fisher, the arch anti-Bayesian, was happy to use Bayes theorem when he thought there was a real randomization mechanism embedded in the data generation process, e.g. in theoretical biology problems involving gene frequencies. This is a consistent position. Just not a Bayesian one.

Hope that helps.

Best Answer

Edit

Related Solutions

Solved – Bayesian analysis with histogram prior. Why draw simulations from the posterior

Naive Bayes – When is a Naive Bayes Model Not Bayesian?

Related Question