Likelihood Ratio vs Bayes Factor – Understanding Key Differences

bayes-factorslikelihood-ratio

I'm rather evangelistic with regards to the use of likelihood ratios for representing the objective evidence for/against a given phenomenon. However, I recently learned that the Bayes factor serves a similar function in the context of Bayesian methods (i.e. the subjective prior is combined with the objective Bayes factor to yield an objectively updated subjective state of belief). I'm now trying to understand the computational and philosophical differences between a likelihood ratio and a Bayes factor.

At the computational level, I understand that while the likelihood ratio is usually computed using the likelihoods that represent the maximum likelihood for each model's respective parameterization (either estimated by cross validation or penalized according to model complexity using AIC), apparently the Bayes factor somehow uses likelihoods that represent the likelihood of each model integrated over it's entire parameter space (i.e. not just at the MLE). How is this integration actually achieved typically? Does one really just try to calculate the likelihood at each of thousands (millions?) of random samples from the parameter space, or are there analytic methods to integrating the likelihood across the parameter space? Additionally, when computing the Bayes factor, does one apply correction for complexity (automatically via cross-validated estimation of likelihood or analytically via AIC) as one does with the likelihood ratio?

Also, what are the philosophical differences between the likelihood ratio and the Bayes factor (n.b. I'm not asking about the philosophical differences between the likelihood ratio and Bayesian methods in general, but the Bayes factor as a representation of the objective evidence specifically). How would one go about characterizing the meaning of the Bayes factor as compared to the likelihood ratio?

Best Answer

apparently the Bayes factor somehow uses likelihoods that represent the likelihood of each model integrated over it's entire parameter space (i.e. not just at the MLE). How is this integration actually achieved typically? Does one really just try to calculate the likelihood at each of thousands (millions?) of random samples from the parameter space, or are there analytic methods to integrating the likelihood across the parameter space?

First, any situation where you consider a term such as $P(D|M)$ for data $D$ and model $M$ is considered a likelihood model. This is often the bread and butter of any statistical analysis, frequentist or Bayesian, and this is the portion that your analysis is meant to suggest is either a good fit or a bad fit. So Bayes factors are not doing anything fundamentally different than likelihood ratios.

It's important to put Bayes factors in their right setting. When you have two models, say, and you convert from probabilities to odds, then Bayes factors act like an operator on prior beliefs:

$$ Posterior Odds = Bayes Factor * Prior Odds $$ $$ \frac{P(M_{1}|D)}{P(M_{2}|D)} = B.F. \times \frac{P(M_{1})}{P(M_{2})} $$

The real difference is that likelihood ratios are cheaper to compute and generally conceptually easier to specify. The likelihood at the MLE is just a point estimate of the Bayes factor numerator and denominator, respectively. Like most frequentist constructions, it can be viewed as a special case of Bayesian analysis with a contrived prior that's hard to get at. But mostly it arose because it's analytically tractable and easier to compute (in the era before approximate Bayesian computational approaches arose).

To the point on computation, yes: you will evaluate the different likelihood integrals in the Bayesian setting with a large-scale Monte Carlo procedure in almost any case of practical interest. There are some specialized simulators, such as GHK, that work if you assume certain distributions, and if you make these assumptions, sometimes you can find analytically tractable problems for which fully analytic Bayes factors exist.

But no one uses these; there is no reason to. With optimized Metropolis/Gibbs samplers and other MCMC methods, it's totally tractable to approach these problems in a fully data driven way and compute your integrals numerically. In fact, one will often do this hierarchically and further integrate the results over meta-priors that relate to data collection mechanisms, non-ignorable experimental designs, etc.

I recommend the book Bayesian Data Analysis for more on this. Although, the author, Andrew Gelman, seems not to care too much for Bayes factors. As an aside, I agree with Gelman. If you're going to go Bayesian, then exploit the full posterior. Doing model selection with Bayesian methods is like handicapping them, because model selection is a weak and mostly useless form of inference. I'd rather know distributions over model choices if I can... who cares about quantizing it down to "model A is better than model B" sorts of statements when you do not have to?

Additionally, when computing the Bayes factor, does one apply correction for complexity (automatically via cross-validated estimation of likelihood or analytically via AIC) as one does with the likelihood ratio?

This is one of the nice things about Bayesian methods. Bayes factors automatically account for model complexity in a technical sense. You can set up a simple scenario with two models, $M_{1}$ and $M_{2}$ with assumed model complexities $d_{1}$ and $d_{2}$, respectively, with $d_{1} < d_{2}$ and a sample size $N$.

Then if $B_{1,2}$ is the Bayes factor with $M_{1}$ in the numerator, under the assumption that $M_{1}$ is true one can prove that as $N\to\infty$, $B_{1,2}$ approaches $\infty$ at a rate that depends on the difference in model complexity, and that the Bayes factor favors the simpler model. More specifically, you can show that under all of the above assumptions, $$ B_{1,2} = \mathcal{O}(N^{\frac{1}{2}(d_{2}-d_{1})}) $$

I'm familiar with this derivation and the discussion from the book Finite Mixture and Markov Switching Models by Sylvia Frühwirth-Schnatter, but there are likely more directly statistical accounts that dive more into the epistemology underlying it.

I don't know the details well enough to give them here, but I believe there are some fairly deep theoretical connections between this and the derivation of AIC. The Information Theory book by Cover and Thomas hinted at this at least.

Also, what are the philosophical differences between the likelihood ratio and the Bayes factor (n.b. I'm not asking about the philosophical differences between the likelihood ratio and Bayesian methods in general, but the Bayes factor as a representation of the objective evidence specifically). How would one go about characterizing the meaning of the Bayes factor as compared to the likelihood ratio?

The Wikipedia article's section on "Interpretation" does a good job of discussing this (especially the chart showing Jeffreys' strength of evidence scale).

Like usual, there's not too much philosophical stuff beyond the basic differences between Bayesian methods and frequentist methods (which you seem already familiar with).

The main thing is that the likelihood ratio is not coherent in a Dutch book sense. You can concoct scenarios where the model selection inference from likelihood ratios will lead one to accept losing bets. The Bayesian method is coherent, but operates on a prior which could be extremely poor and has to be chosen subjectively. Tradeoffs.. tradeoffs...

FWIW, I think this kind of heavily parameterized model selection is not very good inference. I prefer Bayesian methods and I prefer to organize them more hierarchically, and I want the inference to center on the full posterior distribution if it is at all computationally feasible to do so. I think Bayes factors have some neat mathematical properties, but as a Bayesian myself, I am not impressed by them. They conceal the really useful part of Bayesian analysis, which is that it forces you to deal with your priors out in the open instead of sweeping them under the rug, and allows you to do inference on full posteriors.

Best Answer

Related Solutions

Related Question