Solved – Machine learning techniques for spam detection, and in general for text classification

dimensionality reductionmachine learningnaive bayessparsetext mining

I am going to configure a system for spam detection. What I have is a dataset of labeled (spam/not-spam) strings containing, mostly, sentences.

I have a background in machine learning techniques, but no background in machine learning applied to text.

One approach could be creating vectors of extremely high dimensions, in which the features are boolean and each feature represent one possible word (present/not present).

Of course such approach is unsatisfactory due the high dimensionality, but also for the extreme sparsity of one feature across the instances and extreme sparsity across the features.

What I am asking is just a few pointers (to tutorials, for example) to simple entry-level techniques that may address the aforementioned shortcomings of boolean per-word encoding.

Any ideas on which tools may be more suitable for this task? Maybe RapidMiner?

Best Answer

One basic technique is Naive Bayes, which is often used for spam filtering. Essentially, you look at the frequencies of words appearing in sentences that you have already judged to be spam, and also at the frequencies of those words appearing in sentences you have already judged to be not spam, and then use those frequencies to make a judgement about new sentences.

You first estimate the conditional probabilities:

  • feature on(1) or off(0) given spam, $\newcommand{\bm}[1]{\mathbf{#1}}P(\bm{f}_i=1|S), P(\bm{f}_i=0|S)$ for all features $\bm{f}_i$, and
  • feature on or off given not spam, $P(\bm{f}_i=1|\neg S), P(\bm{f}_i=0|\neg S)$.

These can be estimated from the training set by summing columns for spam training examples and dividing by the number of spam examples and likewise for non-spam examples. You also need to estimate the incidence of spam: the prior probabilities for spam $P(S)$ and not spam $P(\neg S)$.

Given a new example with a feature vector $\bm{f}$, you can use Bayes rule and the naive assumption of independence of feature probabilities to write $$P(S|\bm{f})=\frac{P(\bm{f}|S)P(S)}{P(\bm{f})} = \frac{P(\bm{f}_1|S)P(\bm{f}_2|S)\ldots P(\bm{f}_n|S)\;P(S)}{P(\bm{f})}=\frac{\left(\prod P(\bm{f}_i|S)\right)\;P(S)}{P(\bm{f})}$$ Similarly $P(\neg S|\bm{f}) = (\prod P(\bm{f}_i|\neg S))P(\neg S)\;/P(\bm{f})$. So you can get the probabilities $P(S|\bm{f})$ and $P(\neg S|\bm{f})$ based on the probabilities estimated earlier. You decide whether the new example is spam or not spam based on which probability is higher, which means that you can drop the denominator $P(\bm{f})$.

This is easy to implement in rapidminer, from the looks of a quick web search. It's generally very easy to implement from scratch: you're talking low 10s of lines of code in a language like R or Matlab. If you do so, it's worth noting that when the estimated conditional probabilities are zero, you need to pick some small non-zero value instead to avoid the products of conditional probabilities being zero. (Also, it's worth considering doing multiplications by adding logs.)

Naive Bayes is a very simple technique, but it has some drawbacks. One is how to deal with conditional probabilities that are zero, as discussed above. Perhaps more seriously, Naive Bayes as described here pays no attention to ordering or context (words occurring together can have quite different meanings from occurring singly). A good way to find information on more sophisticated techniques would be to research sentiment analysis.

Related Question