Solved – Why is the naive bayes classifier optimal for 0-1 loss

bayesianloss-functionsmachine learningnaive bayesoptimization

The Naive Bayes classifier is the classifier which assigns items $x$ to a class $C$ based on the maximizing the posterior $P(C|x)$ for class-membership, and assumes that the features of the items are independent.

The 0-1 loss is the loss which assigns to any miss-classification a loss of "1", and a loss of "0" to any correct classification.

I often read (1) that the "Naive Bayes" Classifier, is optimal for the 0-1 loss.
Why is this true?

(1) One exemplary source: Bayes classifier and Bayes error

Best Answer

Actually this is pretty simple: Bayes classifier chooses the class that has greatest a posteriori probability of occurrence (so called maximum a posteriori estimation). The 0-1 loss function penalizes misclassification, i.e. it assigns the smallest loss to the solution that has greatest number of correct classifications. So in both cases we are talking about estimating mode. Recall that mode is the most common value in the dataset, or the most probable value, so both maximizing the posterior probability and minimizing the 0-1 loss leads to estimating the mode.

If you need a formal proof, the one is given in the Introduction to Bayesian Decision Theory paper by Angela J. Yu:

The 0-1 binary loss function has the following form:

$$ l_\boldsymbol{x}(\hat s, s^*) = 1 - \delta_{\hat ss^*} = \begin{cases} 1 & \text{if} \quad \hat s \ne s^* \\ 0 & \text{otherwise} \end{cases} $$

where $\delta$ is the Kronecker Delta function. (...) the expected loss is:

$$ \begin{align} \mathcal{L}_\boldsymbol{x}(\hat s) &= \sum_{s^*} l_\boldsymbol{x}(\hat s, s^*) \; P(s = s^* \mid \boldsymbol{x}) \\ &= \sum_{s^*} (1 - \delta_{\hat ss^*}) \; P(s = s^* \mid \boldsymbol{x}) \\ &= \sum_{s^*} P(s = s^* \mid \boldsymbol{x}) ds^* - \sum_{s^*} \delta_{\hat ss^*} P(s = s^* \mid \boldsymbol{x}) \\ &= 1 - P(s = s^* \mid \boldsymbol{x}) \end{align} $$

This is true for maximum a posteriori estimation in general. So if you know the posterior distribution, then assuming 0-1 loss, the most optimal classification rule is to take the mode of the posterior distribution, we call this a optimal Bayes classifier. In real-life we usually do not know the posterior distribution, but rather we estimate it. Naive Bayes classifier approximates the optimal classifier by looking at the empirical distribution and by assuming independence of predictors. So naive Bayes classifier is not itself optimal, but it approximates the optimal solution. In your question you seem to confuse those two things.

Related Question