Bayes Classifier – Why Bayes Classifier Is the Ideal Classifier

bayesbayesianclassificationprobability

It is considered the ideal case in which the probability structure underlying the categories is known perfectly.

Why is that with Bayes classifier we achieve the best performance that can be achieved ?

What is the formal proof/explanation for this? As we always use Bayes classifier as a benchmark to compare the performance of all other classifiers.

Best Answer

Why is that with Bayes classifier we achieve the best performance that can be achieved ? What is the formal proof/explanation for this?

Usually, a dataset $D$ is considered to consist of $n$ i.i.d. samples $x_i$ of a distribution that generates your data. Then, you build a predictive model from the given data: given a sample $x_i$, you predict the class $\hat{f}(x_i)$, whereas the real class of the sample is $f(x_i)$.

However, in theory, you could decide not to choose one particular model $\hat{f}_\text{chosen}$, but rather consider all possible models $\hat{f}$ at once and combine them somehow into one big model $\hat F$.

Of course, given the data, many of the smaller modells could be quite improbable or inappropriate (for example, models that predict only one value of the target, even though there are multiple values of the target in your dataset $D$).

In any case, you want to predict the target value of new samples, which are drawn from the same distribution as $x_i$s. A good measure $e$ of the performance of your model would be $$e(\text{model}) = P[f(X) = \text{model}(X)]\text{,}$$ i.e., the probability that you predict the true target value for a randomly sampled $X$.

Using Bayes formula, you can compute, what is the probability that a new sample $x$ has target value $v$, given the data $D$:

$$P(v\mid D) = \sum_{\hat{f}} P(v\mid \hat{f}) P(\hat{f}\mid D)\text{.}$$ One should stress that

  • usually $P(v\mid \hat{f})$ is either $0$ or $1$, since $\hat{f}$ is a deterministic function of $x$,
  • not usually, but almost all the time, it is impossible to estimate $P(\hat{f}\mid D)$ (except for the aforementioned trivial cases),
  • not usually, but almost all the time, the number of possible models $\hat{f}$ is too big, for the upper sum to be evaluated.

Hence, it is very hard to obtain/estimate $P(v\mid D)$ in most of the cases.

Now, we proceed to the Optimal Bayes classifier. For a given $x$, it predicts the value $$\hat{v} = \text{argmax}_v \sum_{\hat{f}} P(v\mid \hat{f}) P(\hat{f}\mid D)\text{.}$$ Since this is the most probable value among all possible target values $v$, the Optimal Bayes classifier maximizes the performance measure $e(\hat{f})$.

As we always use Bayes classifier as a benchmark to compare the performance of all other classifiers.

Probably, you use the naive version of Bayes classifier. It is easy to implement, works reasonably well most of the time, but computes only a naive estimate of $P(v\mid D)$.

Related Question