Naive Bayes – When is a Naive Bayes Model Not Bayesian?

bayesiannaive bayes

I've been using Bayesian inference for a while and as far as I could tell, Naive Bayes was "Bayesian" since it has a prior and a posterior and follows the Bayes rule.

I just read a topic on "Bayesian Naive Bayes" in Bayesian Reasoning and Machine Learning book and I'm having trouble understanding how we can have a 'non-Bayesian' Naive Bayes. They mention that the usual naive Bayes is not Bayesian since it uses Maximum Likelihood Estimate to set p(c) (where c is the class) whereas in the Bayesian version, we can use a, say, Dirichlet distribution for the prior.

My question is on the distinction between the two. Aren't both priors? I have searched a while but that's the only explanation I've managed to gather.

Best Answer

Informally, to be 'Bayesian' about a model (Naive Bayes just names a class of discrete mixture models) is to use Bayes theorem to infer the values of its parameters or other quantities of interest. To be 'Frequentist' about the same model is, roughly, and among other things, to use the sampling distribution of estimators that depend on those quantities to infer what those values might be.

Turning to your Naive Bayes / mixture model. For exposition, let's assume all the component parameters and functional forms are known and there are two components (classes, whatever).

What is described as the 'prior' in a mixture model is a mixing parameter in the early stages of a hierarchically structured generative model. If you estimate this mixing parameter in the usual (ML, i.e. Frequentist) way, via an EM algorithm, then you have taken a convenient route up the model likelihood to find a maximum, and used that as a point estimate of the true value of the mixing parameter. Maybe you use the curvature of the likelihood at that point to give yourself a measure of uncertainty. (But probably not). Typically you'd then use it to get membership probabilities for individual observations by assuming that value and applying Bayes theorem.

This seems Bayesian because it uses Bayes theorem. However, it is unBayesian in two ways: First, you used the same data to determine the 'prior' (the mixing parameter) and some relevant 'posteriors' (membership probabilities for individual observations). So the 'prior' isn't really prior because it's conditioned on the data already. In the second, more general way, of which the first is an instance: Bayes theorem is being used to infer some unknowns (membership probabilities) but not others (the mixing coefficient).

That's why if you decide to do this in a Bayesian fashion then, since you don't know what the mixing parameter value is in advance, you give it some prior distribution. Maybe that's a Dirichlet (hence a Beta in this stripped down exposition) with some parameters or other, set to reflect your uncertainty. Then you figure out how to condition on the data to get a posterior distribution over it and all the other stuff you care about but don't know, such as component memberships for each observation. To infer any subset of these, marginalize out the rest.

In Frequentist terms, there are known and unknown parts of the model, but no uncertain parts, so nothing needs a prior: you either know them e.g. the components are Gaussian, or you don't know them, e.g. the means of each component. Even when there are distributions involved in generating the data, as there are in the mixture model, none of them is a Bayesian prior, regardless of whether you use Bayes theorem on them. Rather they represent actual or hypothetical randomizing mechanisms of some sort. Specifically, the mixture model provides a hypothetical randomization scheme for generating data: Toss a coin weighted according to the value of the mixing parameter to decide on a component, then draw from that component's distribution to generate an observation. This whole process has parameters, and you have to estimate them from the data.

So what looks like 'posterior inference', with a 'prior', is actually regular inference where the data generating process has some distributional machinery in the middle.

This rather like the Frequentist take on mixed models, and unlike Frequentist inference for, say, a regression coefficient, where there is no such intermediate structure to make anybody think of priors or posteriors.

It might be worth noting that Fisher, the arch anti-Bayesian, was happy to use Bayes theorem when he thought there was a real randomization mechanism embedded in the data generation process, e.g. in theoretical biology problems involving gene frequencies. This is a consistent position. Just not a Bayesian one.

Hope that helps.