Naive Bayes Likelihood – Understanding and Calculation

bicmaximum likelihoodnaive bayes

I'm interested in computing the Bayesian Information Criterion (BIC) for a set of Naive Bayes models.

The NB can be described as follows, for a two-class $Y \in {0,1}$ with predictors $X = (x_1, x_2, …, x_k)$, we have the following joint probability

$$
P(Y = y, X_1 = x1, X_2 = x_2, …, X_k = x_k) \\
= P(Y = y) \prod^k_{j = 1} P(X_j = x_j | Y = y)
$$

assuming independence between all attributes. This is also proportional to posterior

$$
\theta = P(Y = y| {\bf{X}} ) \propto P(Y = y) \prod^k_{j = 1} P(X_j = x_j | Y = y)
$$

I denote the posterior with $\theta$ for short.

The BIC has the general formula:

$$
-2 ln(\hat{L}) + k \times ln(n)
$$

where

$\hat{L} = \text{Likelihood}$

or

$-2 ln(\hat{L}) = \text{Deviance}$

$k = \text{parameters to be estimated}$

$n = \text{number of observations}$

My question is, whether for a two-class classification problem, the likelihood for the Naive Bayes model is a Bernoulli density, such as

$$
L(\theta| \bf{X} ) = \prod^n \theta^{y_i} \times (1 – \theta)^{1-y_i}
$$

This would be similar to the logistic regression.

Is this assumption correct?

Also, I am aware that this type of question has been asked before, but a clear answer has not yet been delivered. Maximum Likelihood formula for Naive Bayes

Best Answer

You correctly noticed that what both models output is the conditional probabilities for the target variable given the explanatory variables $\theta = p(y \mid x_1,x_2,\dots,x_m)$. If the target variable $Y$ is binary, then it follows a Bernoulli distribution, and both models use the same distribution in the likelihood function. Even more, you would see the same distribution used in likelihood function by other binary probabilistic classifiers. Similarly, you could have different ways to minimize the same loss function, e.g. squared loss can be minimized using ordinary least squares, or complicated deep neural network with linear output layer using some optimizer. There is not much to derive in here, simply Bernoulli distribution is the distribution for binary data, yet you can find more details in the paper The Naive Bayes Model, Maximum-Likelihood Estimation, and the EM Algorithm by Michael Collins.

The difference between the models is that naive Bayes algorithm is a generative model, while logistic regression is a discriminative model (see this paper by Jurafsky and Martin, or Ng and Jordan, 2002), they also make other assumptions about the data (naive Bayes assumes conditional independence of variables, logistic regression assumes specific linear functional form) and few other differences as described in this Quora thread. So both algorithms approach the same problem in a different way, use a different form of the model, different parameters and different way of estimating them. In naive Bayes, you estimate the conditional probabilities indirectly from the data and apply the Bayes theorem, while in logistic regression you use linear estimator, logistic link function and Bernoulli likelihood function that is maximized to directly estimate the probabilities.

When calculating BIC for naive Bayes the number of parameters $k$ in the formula is the number of probabilities to be estimated, so if the model is

$$ p(y, x_1, x_2, \dots, x_m) = p(y) \prod_{j=1}^m p(x_j \mid y) $$

each $p(y)$ and $p(x_j \mid y)$ are distinct parameters to be estimated from the data, what is achieved by calculating the empirical probabilities. When using exactly the same variables, both logistic regression and naive Bayes classifier would have same $k$ since naive Bayes has prior $p(y)$ where logistic regression has intercept $\beta_0$ plus the parameters per variables.

Jurafsky D. and Martin, J.H. (August 7, 2017) Logistic Regression. [In:] Speech and Language Processing. Online draft.

Ng, A. Y. and Jordan, M. I. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. [In:] Advances in neural information processing systems, pp. 841-848.

Related Question