Maximum Likelihood – Naive Bayes Classifier Explained

maximum likelihoodnaive bayes

With regards to the Naive Bayes classificator, I have read the following in Wikipedia and wanted to know why it is like that:

"In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without accepting Bayesian probability or using any Bayesian methods."

Therefore, if this is the case, I will be ignoring the probability that goes in the denominator:

P(A|B) = P(B|A)*P(A)

Can someone explain me in simple words what it means, please?

Best Answer

Bayesian model is defined in terms of likelihood function (probability of observing the data given the parameters) and priors (assumed distributions for the estimated parameters). Naive Bayes algorithm estimates the probabilities directly from the data, so it does not make any assumptions about their distributions (does not use priors), so it is not Bayesian.

The algorithm estimates the joint probability for the target variable $y$ and the features $x_1, x_2, \dots, x_m$ and then classifies by choosing classes that got assigned the greatest probabilities. It is called as naive, because it makes the "naive" assumption that the features are independent, so we can write

$$ p(x_1, x_2, \dots, x_m \mid y) = \prod_{j=1}^m p(x_j \mid y) $$

and then, using the properties of conditional probability, given this, we can estimate the joint probability

$$ p(y, x_1, x_2, \dots, x_m) = p(x_1, x_2, \dots, x_m \mid y) \; p(y) $$

This may not sound like a big deal, but estimating the probabilities for pairs of variables in $p(x_j \mid y)$ is a way easier then estimating the joint probability $p(x_1, x_2, \dots, x_m \mid y)$ all at once.

What we need to estimate in here, are the conditional $p(x_j \mid y)$ and marginal $p(y)$ probabilities, and we use maximum likelihood for this. It is nicely explained on this blog and in the The Naive Bayes Model, Maximum-Likelihood Estimation, and the EM Algorithm paper by Michael Collins. For general introduction to maximum likelihood estimation, check the Maximum Likelihood Estimation (MLE) in layman terms thread.

If you are dealing with binary, or categorical variables, then they follow Bernoulli and categorical distributions and the maximum likelihood estimators for parameters of those distributions (the probabilities), are simply the empirical proportions. For example, if you observed the sequence of coin tosses $z_1,z_2,\dots,z_n$ that are independent and identically distributed according to Bernoulli distribution with unknown probability of success $\theta$, and among those tosses you observed $k$ heads, then the "most likely" probability of observing heads given this data is $\hat \theta = \tfrac{k}{n}$. So if you previously observed that a third of tosses were heads, then your best guess is, that also third of the future guesses would be heads. Here you can find the formal derivation. Same kind of computations are done for all the probabilities of interest and then everything is multiplied (see formulas above) to obtain the final result.

All this said, if you want, you can estimate the naive Bayes algorithm in Bayesian paradigm by assuming priors for the probabilities and maximizing the posterior probability rather then likelihood.

You can also find the detailed explanation of naive Bayes algorithm with multiple examples on StackOverflow.