[Math] Bayes spam filtering

bayesianprobability

To analyze the words that appear in spam emails, you collect a sample of 1000 emails marked as spam and 1000 emails marked as non-spam. Of the 1000 spam emails, 210 contained the phrase This isn't spam, 99 contained the word urgent and 110 contained the word guarantee. Of the 99 that contained the word urgent, 79 also contained the word guarantee.

Of the 1000 non-spam emails, 23 contained the phrase This isn't spam, 80 contained the word urgent, and 48 contained the word guarantee. Of the 80 that contained the word urgent 8 also contained the word guarantee.

Assuming that the a priori probability of any message being spam is 0.5, what is the probability that an email is spam given each of the following conditions? Give your answers rounded to the nearest integer.

  1. it contains the phrase This isn't spam?
  2. it contains the word guarantee?
  3. it contains both the words urgent and guarantee?
  4. it contains the word urgent but not the word guarantee?

The way i'm approaching to this problem is to list all the conditions we have first,

P(Spam) = 0.5  P(non-spam) =0.5
P(This isn't spam|Spam) = 210/1000 
P(urgent|Spam) = 99/1000
P(guarantee|Spam) = 110/1000
P(urgent and guarantee|non|Spam) = 79/1000 
P(This isn't spam|non-spam) = 23/1000 
P(urgent|non-spam) = 80/1000
P(guarantee|non-spam) = 48/1000
P(urgent and guarantee|non|non-spam) = 8/1000 

So the questions translate to

  1. P(spam|ins't spam)=?
  2. P(spam|guarantee)=?
  3. P(spam|urgent and guarantee)=?
  4. P(spam|urgent no guarantee)=?

Best Answer

For easier notation, define events:

\begin{eqnarray*} S &=& \text{"the email is spam"} \\ I &=& \text{"the email contains the phrase 'this isn't spam'"} \\ G &=& \text{"the email contains the word 'guarantee'"} \\ U &=& \text{"the email contains the word 'urgent'".} \\ \end{eqnarray*}

Using Bayes' Theorem:

$$P(S\mid I) = \dfrac{P(I\mid S)P(S)}{P(I\mid S)P(S) + P(I\mid S^c)P(S^c)} = \dfrac{\frac{210}{1000}\frac{1}{2}}{\frac{210}{1000}\frac{1}{2} + \frac{23}{1000}\frac{1}{2}} = \dfrac{210}{233}.$$

$$P(S\mid G) = \dfrac{P(G\mid S)P(S)}{P(G\mid S)P(S) + P(G\mid S^c)P(S^c)} = \dfrac{\frac{110}{1000}\frac{1}{2}}{\frac{110}{1000}\frac{1}{2} + \frac{48}{1000}\frac{1}{2}} = \dfrac{110}{158}.$$

$$P(S\mid U\cap G) = \dfrac{P(U\cap G\mid S)P(S)}{P(U\cap G\mid S)P(S) + P(U\cap G\mid S^c)P(S^c)} = \dfrac{\frac{79}{1000}\frac{1}{2}}{\frac{79}{1000}\frac{1}{2} + \frac{8}{1000}\frac{1}{2}} = \dfrac{79}{87}.$$

$$P(S\mid U\cap G^c) = \dfrac{P(U\cap G^c\mid S)P(S)}{P(U\cap G^c\mid S)P(S) + P(U\cap G^c\mid S^c)P(S^c)} = \dfrac{\frac{99-79}{1000}\frac{1}{2}}{\frac{99-79}{1000}\frac{1}{2} + \frac{80-8}{1000}\frac{1}{2}} = \dfrac{20}{92}.$$

Related Question