Solved – Machine learning techniques for spam detection, and in general for text classification

dimensionality reductionmachine learningnaive bayessparsetext mining

I am going to configure a system for spam detection. What I have is a dataset of labeled (spam/not-spam) strings containing, mostly, sentences.

I have a background in machine learning techniques, but no background in machine learning applied to text.

One approach could be creating vectors of extremely high dimensions, in which the features are boolean and each feature represent one possible word (present/not present).

Of course such approach is unsatisfactory due the high dimensionality, but also for the extreme sparsity of one feature across the instances and extreme sparsity across the features.

What I am asking is just a few pointers (to tutorials, for example) to simple entry-level techniques that may address the aforementioned shortcomings of boolean per-word encoding.

Any ideas on which tools may be more suitable for this task? Maybe RapidMiner?

Best Answer

One basic technique is Naive Bayes, which is often used for spam filtering. Essentially, you look at the frequencies of words appearing in sentences that you have already judged to be spam, and also at the frequencies of those words appearing in sentences you have already judged to be not spam, and then use those frequencies to make a judgement about new sentences.

You first estimate the conditional probabilities:

feature on(1) or off(0) given spam, $\newcommand{\bm}[1]{\mathbf{#1}}P(\bm{f}_i=1|S), P(\bm{f}_i=0|S)$ for all features $\bm{f}_i$, and
feature on or off given not spam, $P(\bm{f}_i=1|\neg S), P(\bm{f}_i=0|\neg S)$.

These can be estimated from the training set by summing columns for spam training examples and dividing by the number of spam examples and likewise for non-spam examples. You also need to estimate the incidence of spam: the prior probabilities for spam $P(S)$ and not spam $P(\neg S)$.

Given a new example with a feature vector $\bm{f}$, you can use Bayes rule and the naive assumption of independence of feature probabilities to write $$P(S|\bm{f})=\frac{P(\bm{f}|S)P(S)}{P(\bm{f})} = \frac{P(\bm{f}_1|S)P(\bm{f}_2|S)\ldots P(\bm{f}_n|S)\;P(S)}{P(\bm{f})}=\frac{\left(\prod P(\bm{f}_i|S)\right)\;P(S)}{P(\bm{f})}$$ Similarly $P(\neg S|\bm{f}) = (\prod P(\bm{f}_i|\neg S))P(\neg S)\;/P(\bm{f})$. So you can get the probabilities $P(S|\bm{f})$ and $P(\neg S|\bm{f})$ based on the probabilities estimated earlier. You decide whether the new example is spam or not spam based on which probability is higher, which means that you can drop the denominator $P(\bm{f})$.

This is easy to implement in rapidminer, from the looks of a quick web search. It's generally very easy to implement from scratch: you're talking low 10s of lines of code in a language like R or Matlab. If you do so, it's worth noting that when the estimated conditional probabilities are zero, you need to pick some small non-zero value instead to avoid the products of conditional probabilities being zero. (Also, it's worth considering doing multiplications by adding logs.)

Naive Bayes is a very simple technique, but it has some drawbacks. One is how to deal with conditional probabilities that are zero, as discussed above. Perhaps more seriously, Naive Bayes as described here pays no attention to ordering or context (words occurring together can have quite different meanings from occurring singly). A good way to find information on more sophisticated techniques would be to research sentiment analysis.

Related Solutions

Solved – Machine learning techniques for parsing strings

This can be seen as a sequence labeling problem, in which you have a sequence of tokens and want to give a classification for each one. You can use hidden Markov models (HMM) or conditional random fields (CRF) to solve the problem. There are good implementations of HMM and CRF in an open-source package called Mallet.

In your example, you should convert the input to the format below. Moreover, you should generate extra-features.

1600 STREET
Pennsylvania STREET
Ave STREET
, OUT
Washington CITY
, OUT
DC PROVINCE
20500 POSTCODE
USA COUNTRY

Solved – Confused among Gaussian, Multinomial and Binomial Naive Bayes for Text Classification

Typically the multinomial naive bayes model would still be used - basically using the decimal TF-IDF values for each term in each document in place of the count for that term and proceeding as you usually would (TF-IDF is always $>= 0$). This paper provides details of one way to do that and study results:

Rennie, J.; Shih, L.; Teevan, J.; Karger, D. (2003). Tackling the poor assumptions of Naive Bayes classifiers (ICML.http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf).

Detailed explanation:

The idea is instead of using the term frequencies divided by the total number of terms as the categorical probabilities, you compute the TF-IDF representation of each document and use the fraction of TF-IDF values given to each term for a given class - i.e., sum up the values for a term across all documents in the class divided by the total of the sum of values for all terms - to get the probability value for each term. So basically TF-IDF value totals as opposed to count totals are used - instead of adding up 1s and 0s now you are adding decimal values, but the procedure is the same.

Then as with the traditional multinomial naive bayes you take the log of each term probability to get the log-linear decision function. However, in the traditional model you would then multiply this log value with the term frequency and sum across terms. Instead this paper proposes a final normalization first to normalize the log values across terms, before then applying the same linear decision rule.

Table 4 in the paper spells out this procedure clearly.

Best Answer

Related Solutions

Solved – Machine learning techniques for parsing strings

Solved – Confused among Gaussian, Multinomial and Binomial Naive Bayes for Text Classification

Related Question