Solved – Random forest – short text classification

classificationmachine learningrandom forest

I have read this article which compares various machine learning algorithms on SMS spam detection. I have been able to reproduce results for Naive Bayes and now i would like to try Random Forest as well.

With Naive Bayes, it's straightforward. The bag-of-words model is used and missing probabilities are handled with Laplace smoothing. Only words present in an SMS are tested against SPAM|HAM classes. However, i am not sure how to handle the input with random forest.

Suppose i have following setup:

  • 5000 distinct words in training set, after stemming and removal of stop words
  • text to classify is short, e.g. 10 words in average
  • CART used as a tree model
  • random forest selects subset of features, say 2*sqrt(5000) = 141 words for each split
  • word frequency is used as feature value(could be also TF-IDF)

So my questions are:

  1. Generally speaking, regardless of article, can random forest be used effectively for short text classification when the feature space is large ? It seems to me that there may be a lot of weak classifiers due to large feature space and only handful of features present in data to classify.
  2. How to handle unseen words with random forest(not present in training set)? Should they be simply stripped from input or some technique similar to Laplace smoothing should be used ?
  3. If somebody perhaps read this short article, maybe could explain how author represented features for random forest ?

Best Answer

Generally speaking, in a machine learning approach, it is recommended to test several models regardless of their theoretical performance, because their accuracy is dependent of the training dataset. True enough, a couple of algorithms are generally preferred for text classification (SVM, Naive Bayes, multinomial regressions) for various reasons such as linear separation or curse of dimensionality (see this paper). However, nothing prevents you from trying a random forest and examine its performance.

Now to answer your questions.

Generally speaking, regardless of article, can random forest be used effectively for short text classification when the feature space is large ? It seems to me that there may be a lot of weak classifiers due to large feature space and only handful of features present in data to classify.

Given the nature of random forests (a bagging decision tree), it is true that you may come up with a rather weak classifier, especially if only a couple of features are truly significant to determine the outcome. A boosting decision tree may be preferred (such as adaboost or gradient boosting) to weight relevant predictors.

However, keep in mind that in the case of text classification, a preprocessing phase is required to get either your TF or TF-IDF matrix, through which you have already made a selection of pertinent features. Potentially, all features are relevant in this matrix, so the random forest may be performant when you predict your outcome.

How to handle unseen words with random forest(not present in training set)? Should they be simply stripped from input or some technique similar to Laplace smoothing should be used ?

Since you are using word frequency as a feature, you have already defined the grid over which you will be evaluating the outcome of your message. Unseen words will be filtered and stripped in your algorithm because you will be evaluating if the words in your test set are present/absent in your TF matrix.

If somebody perhaps read this short article, maybe could explain how author represented features for random forest ?

The author uses scikit-learn library from python. I conclude that the dataset was most likely a numerical dataframe (categorical values are always converted into logical dummy booleans in python). From my understanding, the author used the following features :

  • Filtered words, based on their frequency ($>5$ and $<500$), and more specifically, if they are absent/present in the message.
  • Message length (number of characters in message), later transformed into five categorical booleans (message length $\le 40$, $\le 60$, ...)
  • Occurrences of character \$
  • Number of numeric strings
  • Presence/absence of non-alphabetic characters
Related Question