Understand the “Naive Bayes” Assumption

conditional probability

I am trying to understand the Wikipedia page concerning the Naive Bayes classifier used in Machine Learning here.

Supposing we have a vector $x = (x_1, \ldots, x_n) \in \mathbb{R}^n$. If we wish to know the probability that this vector belongs to some class $C$, then we are interested in the conditional probability $p(C|x) = \frac {p(x|C)p(C)}{p(X)}$ (RHS by Bayes Rule). The author writes the numerator is equivalent to the joint probability model:

$p(C,x) = p(x_1,\ldots,x_n,C) = p(x_1|x_2,\ldots,x_n,C) p(x_2|x_3, \ldots,x_n,C)\ldots p(x_{n-1}|x_n,C)p(x_n|C_k)p(C_k)$

(by iteration of the chain rule for probability).

Next the "naive" conditional independence assumption comes into play, such that all the features in $x$, (it's components each treated as random variables) are independent. Under this assumption $p(x_i|x_{i+1},\ldots,x_n,C) = p(x_i|C_k)$.

This last statement is where I am confused. I understand that if two random events $A,B$ are independent then $p(A\cap B)= p(A)p(B)$, and intuitively if we have three events $A,B,C$, then the probability that $A$ occurs given $B,C$ where $A$ and $B$ are independent should only depend on $A,C$, but I don't get how this is arrived at mathematically. In this scenario…
$p(A|B,C) = \frac{p(A \cap B \cap C)}{p(B \cap C)}$, and in general

$P (A \cup B \cup C) = P(A) + P(B) + P(C) – P(A \cap B) – P(A \cap C) – P(B \cap C) + P(A \cap B \cap C)$

so that $p(A \cap B \cap C) = -(P(A)+P(B)+P(C)-P(A)P(B)-p(A \cap C) – p(B \cap C) + P(A \cup B \cup C))$

Any insights appreciated.

Best Answer

It's not necessarily the case that $A,B$ independent implies $P(A | B,C) = P(A | C)$. Consider, for example, two random variables $X,Y$ that are independent and each take value $1$ with probability $1/2$ and value $-1$ with probability $1/2$. Let $A = \{X = 1\}, B = \{Y=1\}, C = \{X+Y = 0\}$. Then $A$ and $B$ are independent, $P(A | B,C) = 0$, and $P(A | C) = \frac{1}{2}$.

To deduce $P(A | B,C) = P(A|C)$, you need to know that $C$ does not provide a "link" between the independent events $A$ and $B$. The way to formalize a lack of link is to say that $A,B$ are conditionally independent given $C$, i.e., that $P(A,B|C) = P(A|C)P(B|C)$. Note that the equality implies \begin{align*} P(A|B,C) &= \frac{P(A,B,C)}{P(B,C)} \\ &= \frac{P(A,B,C)}{P(C)}\frac{P(C)}{P(B,C)} \\ &= P(A,B|C)\frac{1}{P(B|C)} \\ &= P(A|C)P(B|C)\frac{1}{P(B|C)} \\ &= P(A|C), \end{align*} which is what is used in the machine learning setting.

Related Question