Solved – Optimality of the Bayes classifier

bayesianclassificationcontinuous dataproof

Let $X$ be a random variable with values in $\mathcal{X}$ and $Y$ a random variable with values in $\{0,1\}$. Is it true that :
$$\mathbb{P}(g(X) \ne Y) = \int_\mathcal{X} \mathbb{P}\big(g(X)\ne Y \hspace{1mm} \mid X=x\big)d\mathbb{P}_X(x)$$
?

I agree with this equality in the case where $X$ is discrete (i.e $\mathcal{X}$ is a countable set), but I can't prove why it holds in the general setting where $X$ is not discrete and does not have a density function. Actually, I can't prove the equality when $X$ has a density neither.

I came across this equality in at least 2 articles about statistical learning theory :

The statements, whose proof uses the equality above, say that the Bayes classifier realizes the minimum of the risk $\mathbb{P}(g(X)\ne Y)$ over all measurable functions $g$.

Best Answer

There's no need to treat separately the cases where $X$ is discrete, continuous or neither. We are just taking an iterated expectation here, which is always permissible. The idea is that since probabilities are simply expectations of indicator random variables we can always write something like

\begin{align} P(g(X) \neq Y) &= \text{E} \left (I_{ \{ g(X) \neq Y \}} \right ) \\ &= \text{E} \left [ \text{E} \left (I_{ \{ g(X) \neq Y \}} \mid X \right ) \right ] \\ &= \text{E}[ P(g(X) \neq Y \mid X)] \end{align}

and the integral above is just another way of writing this down. Again, this holds true no matter what "type" of random variable $X$ happens to be.

Related Question