Bayesian Analysis – Why Bayes Factor Sometimes More Important Than Posterior Odds?

bayesianodds-ratio

To the best of my knowledge, the posterior odds satisfies the equation: $$(\text{posterior odds}) = (\text{Bayes factor}) \times (\text{prior odds}) $$ This is a simple consequence of Bayes' rule.

The whole point of Bayesian inference when applied to model selection, or so I thought, was to use the information from the prior probabilities to get a more accurate estimate of what the correct answer is than the naive estimate one gets from the likelihoods, which is given by the Bayes factor.

However, I recall having read several papers where the Bayes factors were reported as evidence of one model being more likely than the other.

Was the idea of the paper's authors to appeal to frequentists who would have considered it taboo to incorporate information from prior probabilities, and to show that their argument was (their results were) robust to such objections on methodological/philosophical grounds?

Would a Bayesian ever be more interested in the Bayes factor than the posterior odds?

Note: I had these questions while reading the first chapter of James Stone's "Bayes' Rule: A Tutorial Introduction to Bayesian Analysis" and while thinking back to some papers I had read a while ago about influenza virus transmission. I can try to find the paper if that would help.

Anyway I am a complete novice at this so I apologize in advance if this question is non-sensical.

Best Answer

The Bayes factor is defined on hypotheses, not parameter values.

For hypotheses $H_1$ and $H_2$, with observed data $Y$, we define the Bayes factor $\frac{P\left( Y\ |\ H_1 \right)}{P\left( Y\ |\ H_2 \right)}$. When $H_1$ and $H_2$ are point hypotheses, it is in fact equivalent to a likelihood ratio, and when $H_1$ and $H_2$ are nested this likelihood ratio is usable in all the standard statistical tests.

But we generally aren't interested in Bayes factors of point hypotheses in nested models. We want to compare model specifications wholesale, and that is something we cannot do with likelihoods. This is possible because an "un-fitted" statistical model $H$ is effectively a compound hypothesis over the entire parameter space $\Theta$ for that model. By this logic, we can treat any model as an hypothesis and use the law of total probability to obtain $$ P\left( Y\ |\ H \right) = \int_\Theta P\left( Y\ |\ \theta, H \right) P\left( \theta\ |\ H \right)\ \mathrm{d}\Theta $$ which is clearly not the same thing as the maximum likelihood $\max_{\theta \in \Theta} P\left( Y\ |\ \theta, H \right)$. It should be obvious from this definition that the Bayes factor does depend on one's choice of priors, and heavily so. In fact, Bayes factors can be used to compare the plausibility of different priors for otherwise identical models (philosophical concerns notwithstanding).

You have in mind something like "the posterior odds of $\theta$ and $\theta'$", and therefore you are confused as to how a Bayes factor is any different from a likelihood. What you need to consider is not the posterior odds of two specific parameter values $\theta$ and $\theta'$ but the posterior odds of entire models. Stats 101 implicitly trains us to think of hypotheses as numerical values. To understand Bayes factors, it is better to think of an hypothesis as a pair like $\left(M, \Theta\right)$, where $\Theta$ is a parameter space and $M$ is a representation of the model specification.

You noted that the Bayes factor can be interpreted as $(\text{posterior odds}) = (\text{Bayes factor}) \times (\text{prior odds})$. This isn't wrong, but your question evinces the danger in relying too heavily on a simple interpretation of a rich concept.

There are actually several very nice writeups and explanations of Bayes factors out there on the Internet. Here are a few I've found helpful:

Likelihood ratio vs Bayes Factor (posted in the comments)
Kass and Raftery, 1995. "Bayes factors", JASA 90 (430). http://www.stat.cmu.edu/~kass/papers/bayesfactors.pdf (thorough review with case-study-esque examples)
Morey, February 9, 2014. "What is a Bayes factor?", BayesFactor blog. http://bayesfactor.blogspot.com/2014/02/the-bayesfactor-package-this-blog-is.html
Schönbrodt, January 26, 2015. "What does a Bayes factor feel like?", personal blog via R-bloggers. https://www.r-bloggers.com/what-does-a-bayes-factor-feel-like/ (contains some practical advice and links to other resources)
Schönbrodt, January 21, 2014. "A short taxonomy of Bayes factors", personal blog. http://www.nicebread.de/a-short-taxonomy-of-bayes-factors/
Etz, April 15, 2015. "Understanding Bayes: A Look at the Likelihood", personal blog. https://alexanderetz.com/2015/04/15/understanding-bayes-a-look-at-the-likelihood/
Etz, August 9, 2015. "Understanding Bayes: Visualization of the Bayes Factor", personal blog. https://alexanderetz.com/2015/08/09/understanding-bayes-visualization-of-bf/

Related Solutions

Solved – Bayesian analysis with histogram prior. Why draw simulations from the posterior

To answer your subquestion: How to do the following more elegantly?

post.vector <- vector()
post.vector[1] <- sum(post[p < 0.1])
post.vector[2] <- sum(post[p > 0.1 & p <= 0.2])
post.vector[3] <- sum(post[p > 0.2 & p <= 0.3])
post.vector[4] <- sum(post[p > 0.3 & p <= 0.4])
post.vector[5] <- sum(post[p > 0.4 & p <= 0.5])
post.vector[6] <- sum(post[p > 0.5 & p <= 0.6])
post.vector[7] <- sum(post[p > 0.6 & p <= 0.7])
post.vector[8] <- sum(post[p > 0.7 & p <= 0.8])
post.vector[9] <- sum(post[p > 0.8 & p <= 0.9])
post.vector[10] <- sum(post[p > 0.9 & p <= 1])

The easiest way to do it using base R is:

group <- cut(p, breaks=seq(0,1,0.1), include.lowest = T)
post.vector.alt <- aggregate(post, FUN=sum, by=list(group))

Note that the breaks go from 0 to 1. This yields:

     Group.1            x
1    [0,0.1] 3.030528e-13
2  (0.1,0.2] 1.251849e-08
3  (0.2,0.3] 6.385088e-06
4  (0.3,0.4] 6.732672e-04
5  (0.4,0.5] 2.376448e-01
6  (0.5,0.6] 7.372805e-01
7  (0.6,0.7] 2.158296e-02
8  (0.7,0.8] 2.691182e-03
9  (0.8,0.9] 1.205200e-04
10   (0.9,1] 3.345072e-07

And we have:

> all.equal (post.vector.alt$x, post.vector)
[1] TRUE

Naive Bayes – When is a Naive Bayes Model Not Bayesian?

Informally, to be 'Bayesian' about a model (Naive Bayes just names a class of discrete mixture models) is to use Bayes theorem to infer the values of its parameters or other quantities of interest. To be 'Frequentist' about the same model is, roughly, and among other things, to use the sampling distribution of estimators that depend on those quantities to infer what those values might be.

Turning to your Naive Bayes / mixture model. For exposition, let's assume all the component parameters and functional forms are known and there are two components (classes, whatever).

What is described as the 'prior' in a mixture model is a mixing parameter in the early stages of a hierarchically structured generative model. If you estimate this mixing parameter in the usual (ML, i.e. Frequentist) way, via an EM algorithm, then you have taken a convenient route up the model likelihood to find a maximum, and used that as a point estimate of the true value of the mixing parameter. Maybe you use the curvature of the likelihood at that point to give yourself a measure of uncertainty. (But probably not). Typically you'd then use it to get membership probabilities for individual observations by assuming that value and applying Bayes theorem.

This seems Bayesian because it uses Bayes theorem. However, it is unBayesian in two ways: First, you used the same data to determine the 'prior' (the mixing parameter) and some relevant 'posteriors' (membership probabilities for individual observations). So the 'prior' isn't really prior because it's conditioned on the data already. In the second, more general way, of which the first is an instance: Bayes theorem is being used to infer some unknowns (membership probabilities) but not others (the mixing coefficient).

That's why if you decide to do this in a Bayesian fashion then, since you don't know what the mixing parameter value is in advance, you give it some prior distribution. Maybe that's a Dirichlet (hence a Beta in this stripped down exposition) with some parameters or other, set to reflect your uncertainty. Then you figure out how to condition on the data to get a posterior distribution over it and all the other stuff you care about but don't know, such as component memberships for each observation. To infer any subset of these, marginalize out the rest.

In Frequentist terms, there are known and unknown parts of the model, but no uncertain parts, so nothing needs a prior: you either know them e.g. the components are Gaussian, or you don't know them, e.g. the means of each component. Even when there are distributions involved in generating the data, as there are in the mixture model, none of them is a Bayesian prior, regardless of whether you use Bayes theorem on them. Rather they represent actual or hypothetical randomizing mechanisms of some sort. Specifically, the mixture model provides a hypothetical randomization scheme for generating data: Toss a coin weighted according to the value of the mixing parameter to decide on a component, then draw from that component's distribution to generate an observation. This whole process has parameters, and you have to estimate them from the data.

So what looks like 'posterior inference', with a 'prior', is actually regular inference where the data generating process has some distributional machinery in the middle.

This rather like the Frequentist take on mixed models, and unlike Frequentist inference for, say, a regression coefficient, where there is no such intermediate structure to make anybody think of priors or posteriors.

It might be worth noting that Fisher, the arch anti-Bayesian, was happy to use Bayes theorem when he thought there was a real randomization mechanism embedded in the data generation process, e.g. in theoretical biology problems involving gene frequencies. This is a consistent position. Just not a Bayesian one.

Hope that helps.

Best Answer

Related Solutions

Solved – Bayesian analysis with histogram prior. Why draw simulations from the posterior

Naive Bayes – When is a Naive Bayes Model Not Bayesian?

Related Question