Solved – Bayes decision rule

bayesianmachine learning

Assume binary classification i.e. $y \in \{-1,1\}$ and that the underlying joint probability distribution generating the data is known i.e. $P_{x,y}(x,y)$ is known

I was told that Bayes decision rule was the predictor you choose when solving the following minimization problem with the indicator variable cost function (indicating whether you are right or wrong):

$$min_{c \in \mathcal{H}} \mathbb{E}_{x,y}[\mathbb{1}_{\{c(x) \neq y \}}]$$

I was wondering, what was the resulting predictor $c^*$ from solving the above optimization problem and what its relation to the known distribution generating the data was. i.e. what was the predictors $c^*$ relation to $P_{y|x}(1|x)$ and $P_{y|x}(-1|x)$.

What I had done so far was expand $ \mathbb{E}_{x,y}[\mathbb{1}_{\{c(x) \neq y \}}]$:

$\mathbb{E}_{x,y}[\mathbb{1}_{\{c(x) \neq y \}}] = \mathbb{E}_{x} \mathbb{E}_{y|x}[\mathbb{1}_{\{c(x) \neq y \}}]$

and then minimize the following:

$\mathbb{E}_{y|x}[\mathbb{1}_{\{c(x) \neq y \}}] = P_{y|x}(1 | x)\mathbb{1}_{\{c(x) \neq 1\}} + P_{y|x}(-1|x)\mathbb{1}_{\{ c(x) \neq 1\}}$

But I had a hard time moving on because I was unsure how to minimize the above expression.
Intuitively I want to choose the predictor that makes my error the lowest. So I would choose the label $1$ or $-1$, depending on which had the highest probability of occurring. But I was having a hard time linking that intuition with the math and the equation above, in a precise or rigorous matter.

What is the explicit function for $c^*(x)$?

Is the following function the correct one? If it is, why so?

$$ c^*(x) = sign(2p_{y|x}(1|x) – 1)$$

Best Answer

Consider random variables $X$ and $Y$, where $Y \in \{+1,-1\}$. When the observation $X$ has value $x$, the decision rule $c(x)$, which takes on one of the two values $+1$ and $-1$, tells us what value the rule thinks $Y$ has taken on. The choice of decision function $c(x)$ effectively partitions the range of $X$ into two disjoint sets $\Gamma_{+1}$ and $\Gamma_{-1}$, that is, $c(x)$ can be expressed as $$c(x) = \begin{cases}+1, & x \in \Gamma_{+1},\\-1, & x \in \Gamma_{-1}. \end{cases} $$ The experiment is performed, resulting in $(X,Y)$ taking on value $(x,y)$, but we can only observe the value of $x$. We apply the function $c(x)$ to get our decision $+1$ or $-1$ as to what the value of $y$ is. A superior being (who knows everything including the value of $y$ that has been hidden from us) then tells us whether we made a mistake or not: mistakes when $y$ does not match the decision $c(x)$ that we reached. Let $f_{-1}(x)$ denote the conditional density of $X$ given that $Y = -1$. Then, given that $Y=-1$, we make a mistake if the observed value of $X$ is in the region $\Gamma_{+1}$, and the conditional probability of error is thus $\displaystyle P(E\mid Y=-1) = \int_{\Gamma_{+1}} f_{-1}(x)\,\mathrm dx.$ Similarly, the conditional probability of error when $Y=+1$ is $\displaystyle P(E\mid Y=+1) = \int_{\Gamma_{-1}} f_{+1}(x)\,\mathrm dx.$ Hence, the unconditional probability of error $P(E)$ of this decision rule is $$\begin{align} P(E) &= P\{E \mid Y = -1\}P\{Y = -1\} + P\{E \mid Y = +1\}P\{Y = +1\}\\ &= \int_{\Gamma_{+1}}\pi_{-1}\cdot f_{-1}(x)\,\mathrm dx + \int_{\Gamma_{-1}}\pi_{+1}\cdot f_{+1}(x)\,\mathrm dx\\ &= \int_{\Gamma_{+1}}\pi_{-1}\cdot f_{-1}(x)\,\mathrm dx + \int_{\Gamma_{-1}}\pi_{+1}\cdot f_{+1}(x)\,\mathrm dx\\ &\quad + \int_{\Gamma_{-1}}\pi_{-1}\cdot f_{-1}(x)\,\mathrm dx - \int_{\Gamma_{-1}}\pi_{-1}\cdot f_{-1}(x\,\mathrm dx\\ &= \pi_{-1} \int_{\mathbb R}f_{-1}(x)\,\mathrm dx + \int_{\Gamma_{-1}}\left[\pi_{+1}\cdot f_{+1}(x) - \pi_{-1}\cdot f_{-1}(x)\right]\,\mathrm dx\\ P(E) &= \pi_{-1} + \int_{\Gamma_{-1}}\left[\pi_{+1}\cdot f_{+1}(x) - \pi_{-1}\cdot f_{-1}(x)\right]\,\mathrm dx\tag{1} \end{align}$$

The Bayesian decision rule is the rule which minimizes the right side of $(1)$. We can do nothing with the first term which is the same for all decision rules, but by clever choice of the region $\Gamma_{-1}$ (the decision rule is effectively defined by the region $\Gamma_{-1}$), we can make $P(E)$ smaller. Note that the integrand in $(1)$ can be positive or negative, and by choosing $$\Gamma_{-1} = \{x \colon \pi_{+1}\cdot f_{+1}(x) - \pi_{-1}\cdot f_{-1}(x) \leq 0\}, \tag{2}$$ (thus excluding from $\Gamma_{-1}$ all points $x$ for which $\pi_{+1}\cdot f_{+1}(x) - \pi_{-1}\cdot f_{-1}(x) > 0$), we make sure that the integrand is never positive in the range of integration, and so the integral has as negative a value as possible. Hence, the decision rule described in $(2)$ minimizes $P(E)$, and is the Bayesian decision rule.

So how does all this play out in terms of posterior distributions? The posterior distribution of $Y$ given $X$ is discrete, and the Bayesian decision rule works out to be to choose whichever value of $Y$ has greater posterior probability. In fact, we have that $$\begin{align} P\{Y=+1\mid X = x\} &= \frac{\pi_{+1}f_{+1}(x)}{\pi_{+1}\cdot f_{+1}(x) + \pi_{-1}\cdot f_{-1}(x)}\tag{3}\\ P\{Y=-1\mid X = x\} &= \frac{\pi_{-1}f_{-1}(x)}{\pi_{+1}\cdot f_{+1}(x) + \pi_{-1}\cdot f_{-1}(x)}\tag{4} \end{align}$$ and so, choosing whichever posterior probability is larger gives the same decision rule as $(2)$. Now, if $P\{Y=+1\mid X = x\} = p_{y|x}(1|x)$ in the OP's notation is larger than $P\{Y=-1\mid X = x\}$, then is is true that $p_{y|x}(1|x) > \frac 12$, and so $\operatorname{sgn}(2p_{y|x}(1|x) -1) = +1$, and so

Yes, the Bayes decision rule $c^*(x)$ can be expressed as $\operatorname{sgn}(2p_{y|x}(1|x) -1)$

However, the fact that this choice minimizes $P(E)$ is a lot harder to see from $(3)$ and $(4)$ or from the succinct expression $\operatorname{sgn}(2p_{y|x}(1|x) -1)$ than from the development that led to $(2)$. Or at least, that is how I, a non-statistician, perceive the matter; your mileage may vary.

Best Answer

Related Solutions

Solved – What does improper learning mean in the context of statistical learning theory and machine learning

Related Question