Bayes Optimal Classifier – Defining a Bayes Optimal Classifier in Classification Tasks

bayesianclassificationdecision-theorymachine learningsupervised learning

Say we are given a data-distribution $D$ over $\mathcal{X} \times \mathcal{Y}$, where $\mathcal{X}$ is our input/feature space and $\mathcal{Y}$ the set of (discrete) labels. Hence, $D$ is a joint-distribution over datapoints $x \in \mathcal{X}$ and their associated labels $y \in \mathcal{Y}$. We denote the associated random variables $X$ and $Y$.

In machine learning, we often want to find a "good" predictor/classifier $f$ which associates to a given datapoint $x$ a label $y$. To define what we mean by "good", one often uses the $0-1$-loss $\ell(y,f(x))$, which is $0$ if $y = f(x)$ and $1$ otherwise. Then, one wants to choose a function $f$, which minimize the expected loss over the data-distribution, also called the (true) risk defined as $R(f) = \mathbb{E}_{(x,y)\sim D} \ell(y,f(x))$.
Now, it is easy to show that the optimal $f^*$, minimizing the true risk, is $f^*(x):=\text{argmax}_{y \in \mathcal{Y}} \mathbb{P}(Y=y|x)$. $f^*$ is called the Bayes (optimal) classifer.

My question now is simple: Is $f^*$ called Bayes classifier, because it $i)$
associated the most likely class to $x$, conditioned on $x$, or $ii)$ because it is minimizes the risk.


Some thoughts / what I have researched so far to answer this question:

It is easy to see that depending on how one defines the risk, the optimal classifier may not be the most likely class, by e.g., associating different penalties to different class-predictions $\hat{y}:=f(x)$. Hence, $i)$ and $ii)$ are not equivalent.

Regarding $ii)$:

I have delved a bit into statistical decision theory of which I am by no means an expert in and hence, hope for some help in this community. Works on decision theory define the Bayes decision rule, as the rule which minimzes the Bayes risk. This is usually defined w.r.t. some parameters $\theta \in \Theta$, which we want to estimated based on data $x \in \mathcal{X}$ using a decision rule $\delta(x)$ and we assume we have a prior distribution $\pi$ over $\theta$. Then, the Bayes risk (with respect to some loss $\ell$) can be written as

$$
r(\delta) = \int_\Theta \int_\mathcal{X} \ell(\theta, \delta(x)) p(x|\theta) dx \pi(\theta) d\theta
$$

Now, if we associate $\theta$ with $y$, and $\delta$ with $f$ we can rewrite $r(\delta)$:

$$
r(f) = \int_\mathcal{Y} \int_\mathcal{X} \ell(y, f(x)) p(x|y) dx \pi(y) dy \\
= \int_\mathcal{Y} \int_\mathcal{X} \ell(y, f(x)) p(x,y) dx dy \\
= \int_{\mathcal{X} \times \mathcal{Y}} \ell(y, f(x)) p(x,y) dx dy \\
= \mathbb{E}_{(x,y)\sim D} \ell(y,f(x)) = R(f)
$$

The second line follows from the fact if the prior $\pi(y)$ is actually the true marginal distribution of $y$, i.e. $\pi(y)=\int_\mathcal{X} p(x,y) dx$. I write $\int_\mathcal{Y}$ out of simplicity, but if $\mathcal{Y}$ is discrete, I mean the sum (as in the above classification scenario). The third line follows from interchanging the integrals and assuming that the conditions of the Fubini-Tonelli theorem are satisfied. The fourth line and hence, the equivalence with the true risk in classification follows by definition.

Therefore, $f$ minimizing the true risk $R(f)$ is equivalent to a decision rule $\delta$ minimzing the Bayes risk $r(\delta)$ with the specific prior and hence, is a Bayes optimal decision rule and hence, we call it a Bayes optimal classifier. Is this derivation / argument true? This would mean, if I choose a different loss, the Bayes optimal classifier may not be the most likely class. However, if $i)$ would be true, than the Bayes classifer would just not be an optimal classifier anymore.

Regarding $i)$. Here, an argumentation can maybe be developed based on saying that the Bayes classifier is some sort of posterior probability over the classes given the data.

Related posts, which my post is partly based on but which do not directly answer my question, are Mathematical definition of Bayes Optimality and Different definitions of Bayes risk

Edit:
I found this work which is the first and only ML reference I found clearly defining (and not just stating) the Bayes classifier and which supports hypothesis $ii)$, i.e. they define the Bayes classifier as the classifier minimizing the expected loss and not necessarily the one choosing the most likely class.

Best Answer

Interesting question, I will try to given an answer focused mainly on the history and terminology point of view.

First off, that paper by Berner et al. you mention is far from begin the "first and only ML reference" defining the Bayes classifier. In fact, in that very same paper, the authors cite the book Learning Theory : An Approximation Theory Viewpoint (2007) by Cucker and Zhou as a reference which defines the Bayes classifier.

In said book, the authors indeed define in Proposition 9.3 the Bayes classifier (or Bayes rule) for the binary classification case ($\mathcal Y := \{-1;1\} $) as $$f^*(x) := \begin{cases}1 &\text{if }\ \mathbb P(Y=1\mid X=x)\ge\mathbb P(Y=-1\mid X=x)\\ -1 &\text{if }\ \mathbb P(Y=-1\mid X=x)>\mathbb P(Y=1\mid X=x)\end{cases} $$ (Which is simply the binary version of the definition you gave) and mention that they give it that name because it is a minimizer of the risk $\mathcal R$, hence for these authors the answer to your question is $ii)$.

Going further back, the earliest reference (I could find) which introduces the notions of Bayes risk and Bayes classifier is Introduction to Statistical Pattern Recognition (1972) by Keinosuke Fukunaga. In Section 3.1 of that book, Fukunaga introduces the problem of binary classification in the framework of Bayesian hypothesis testing :
there are two classes $\omega_1$ and $\omega_2$, and we assume that we know the a priori densities of each class $p_i(X) = \mathbb P( \omega_i\mid X) $ as well as their a priori (unconditional) probabilities $P_i =\mathbb P(\omega_i)$. We can then assign a class to an observation $X$ by using Bayes theorem and computing the posterior densities $q_i(X)$ $$q_i(X) = \frac{P_i p_i(X)}{p_1(X) + p_2(X)} $$ And assigning to $X$ the index $i$ which maximizes that quantity. The author calls this procedure "performing a Bayes test for minimum error", where the error (called the Bayes error), is the expectation of the conditional error $r(X)$ defined as $$r(X) = \min\{q_1(X);q_2(X)\} $$ Here, the quantity $r(X)$ represents the probability to assign the wrong class to $X$. Quick computations show that the Bayes error (what you would call the risk) is then given by $$\mathcal R \equiv \mathbb E[r(X)] = P_1\int_\Omega p_1(x)\mathbf 1_{\omega_2}(x)d\mathbb P(x) +P_2\int_\Omega p_2(x)\mathbf 1_{\omega_1}(x)d\mathbb P(x) $$ Two things appear from the above formulation : the Bayes error is by construction the expected 0-1 loss of the Bayes test $\arg\max_{i} q_i(X)$, and that classification rule is optimal by construction. This seems to say that "Bayes decision rule" is the name given to the classifier that minimizes the risk, and also that it is the name of the classifier which assigns to $x$ the most likely class (so both $i)$ and $ii)$ are true).

But in fact, a few pages later, Fukunaga introduces the Bayes decision rule for minimum cost, which now minimizes the expected conditional cost of assigning $X$ to $\omega_i$ given that $X\in\omega_j$ and $j\ne i$ : $$r_i(X) = c_{i,1}q_1(X) + c_{i,2}q_2(X) \rightarrow r(X) = \min\{r_1(X);r_2(X)\} $$ where $c_{i,j}$ is the cost associated with wrongfully assigning $X$ to $\omega_i$ instead of $\omega_j$.

In that setting, it is true as you noticed that the optimal classifier needs not be the one that assigns to $x$ its most likely class. The author still refers to the optimal classifier as the "Bayes decision rule", so we can conclude that, at least historically, "Bayes classifier" is the name given to the classifier that minimizes the risk, and the name comes from the fact that that classifier is derived from considering the problem from a Bayesian hypothesis testing perspective.

Conclusion : $f^*$ is called Bayes classifier because it is the minimizer of the risk (which is historically referred to as the "Bayes error" in many classical texts such as Devroye et al.'s A probabilistic theory of pattern recognition (1996) or Hastie and Tibshirani's The elements of statistical learning (2009)), but in many cases, it can still be true that $f^*$ is the function that assigns to $x$ its most likely class, so although $ii)$ is always true, scenario $i)$ and scenario $ii)$ may still coincide.

Related Question