Solved – MLE and MAP with Naive Bayes

map-estimationmaximum likelihoodnaive bayes

From what I understand, Naive Bayes classifies by doing:

$$
y \leftarrow argmax_{y_k}P(Y=y_k)\prod_{i}P(X_i|Y=y_k)
$$

There are two things there we need to know: $P(Y=y_k)$ and all the $P(X_i|Y=y_k)$

Now, for estimating the parameters of a probability distribution

  • MLE maximizes $P(D|\theta)$
  • MAP maximizes $P(\Theta|D) = \frac{P(D|\Theta)P(\Theta}{P(D)}$

The problem for me is bringing everything together.

From what I've read so far, both $P(Y=y_k)$ and all the $P(X_i|Y=y_k)$ are the parameters of the Naive Bayes model.

So these things is what we want to estimate using either MLE or MAP.

So far so good. But now I get confused:

What exactly is $P(D|\Theta)$ in this case? Is it $P(Y=y_k)\prod_{i}P(X_i|Y=y_k)$?

What confuses me is that MLE kinda looks like $P(X_i|Y=y_k)$. But then I read that $P(Y=y_k)$ can also be estimated using MLE.

But the formula for MLE contains a conditional probability. How does $P(Y=y_k)$ fit into the picture?

For MAP, it kinda makes more sense to me, because we assume a prior for $P(Y=y_k)$.

Best Answer

Naive Bayes is not the best example in here, hence your confusion. Naive Bayes is the whole classification algorithm, that tells us how to make classifications given the data, by calculating the conditional probabilities and combining them, by making the naive assumption of independence. I said that this is not the best example, because the algorithm uses Bayes theorem to combine the probabilities, so people sometimes confuse this with thinking that the algorithm has something in common with Bayesian statistics, it doesn't.

You are correct, in naive Bayes the probabilities are parameters, so $P(Y=y_k)$ is a parameter, same as all the $P(X_i|Y=y_k)$ probabilities. The standard, maximum likelihood, approach is to calculate the probabilities using MLE estimators, e.g. for $P(Y=y_k)$ you would count the cases where $Y=y_k$ and divide by the sample size, same for conditional probabilities, but there you would count within subsets. Bayesian, maximum a posteriori, estimate for such probabilities would need from you to assume prior distributions for the parameters. The popular model for binary data would be beta-binomial model, or Dirichlet-categorical for categorical variables, where we have closed-form solutions for the posterior distributions of the parameters.