Solved – Determining the variables when training a/classifying with a random forest

classificationrandom forest

Based on a response to the question "Statistical classification of text", I've decided that I want to use a random forest to perform various classifications of text.

Specifically, I am getting YouTube video data from their XML feeds. Here is an example:

<entry gd:etag="W/"DUQBR347eCp7I2A9WhdXFEQ."">
    <id>tag:youtube.com,2008:video:ltkMAcv2cas</id>
    <published>2010-12-23T18:11:28.000Z</published>
    <updated>2011-08-28T02:09:16.000Z</updated>
    <category scheme="http://schemas.google.com/g/2005#kind"
        term="http://gdata.youtube.com/schemas/2007#video"/>
    <category scheme="http://gdata.youtube.com/schemas/2007/keywords.cat" 
        term="SSF4AE"/>
    <!-- More categories/keywords here -->
    <title>SSF4AE Mago 2D God VS Kazunoko aka Inoue</title>
    <!-- More video information here, description, etc -->
</entry>

With that, I have very specific data that I believe I can use as the basis for variables in a random forest:

Variable    Value
--------    -----
Title       SSF4AE Mago 2D God VS Kazunoko aka Inoue
Category    SSF4AE

Looking at the video and based on the title/categories, I know that this is a video of the game Super Street Fighter 4: Arcade Edition (SSF4AE).

Looking at the alglib Decision Forest, I also read the general principals of data analysis methods, which speaks about encoding specifically for the algorithms on alglib; while I won't use it specifically, I believe it expanded my understanding of how data should be represented for a random forest.

Basically, using the above data, I'd need to transform each data point into a number of some sort.

For example, I could transform this into the following variables:

Variable                 Value
--------                 -----
Has "SSF4AE" in Title        1
Equals "SSF4AE"              1

Now, assuming I know that this video is for SSF4AE, and that I know that SSF4AE is a common acronym used to identify the game, would using the above as variables work for a random forest for training and classification?

Of course, that's a small set, and it's a very rough set of variables; my understanding is that a random forest basically will create nodes where there is a threshold for the variable, values over this fall on one side of the tree, values under on the other side.

In the above example, there are only two options, true or false.

However, I imagine I can do something like get the term-frequency/inverse document frequency and use that (using Lucene.NET or elasticsearch), like so (numbers are not accurate)

Variable                               Value
--------                               -----
Term Query for "SSF4AE" in Title        0.36
Term Query for "SSF4AE" in Category     1.00

And then train the random forest on that.

I assume that if the variables were done as above, then I could also place things such as boosts on the variables (in the tf/idf calculation) where I see that they are more prominent or I feel I want them to have more weight.

And finally, I can add more variables, but they have to be very specific. For example, I know that if there is a character Yun that only appears in SSF4AE, (not previous versions of Street Fighter 4), as well as in Street Fighter III. To that end, I can add two more variables:

Variable                               Value
--------                               -----
Term Query for "SSF4AE" in Title        0.36
Term Query for "SSF4AE" in Category     1.00
One character is Yun                       1
Term Query for "3S" in Title            0.00
Term Query for "3SO" in Title           0.00

("3SO" and "3S" being common acronyms for Street Fighter III: Third Strike)

However, for another video, which I know is for 3S, I the variables might look like this:

Variable                               Value
--------                               -----
Term Query for "SSF4AE" in Title        0.00
Term Query for "SSF4AE" in Category     0.00
One character is Yun                       1
Term Query for "3S" in Title            0.54
Term Query for "3SO" in Title           0.00

Where the training set would indicate that video is for 3S.

That said, are these feasible ways of going about determining the variables of a random forest?

Also, if I understand correctly, the random forest will also indicate which variables contributed the most to the classification, so I can discard them during classification later (if I want) or add different ones to get better classifications from my training data.

Best Answer

As requested, I'll turn my comment to an answer, although I doubt it will provide you with a working solution.

The Random Forest™ algorithm is well described on Breiman's webpage. See references therein. Basically, it relies on the idea of bagging (bootstrap aggregation) where we "average" the results of one (or more) classifier run over different subsamples of the original dataset (1). This is part of what we called more generally ensemble learning. The advantage of this approach is that it allows to keep a "training" sample (approx. 2/3 when sampling with replacement) to build the classifier, and a "validation" sample to evaluate its performance; the latter is also known as the out-of-bag (OOB) sample. Classification and regression trees (CART, e.g. (2)) work generally well within this framework, but it is worth noting that in this particular case we do not want to prune the tree (to avoid overfitting; on the contrary, we deliberately choose to work with "weak" classifiers). By constructing several models and combining their individual estimates (most of the times, by a majority vote), we introduce some kind of variety, or fluctuations, in model outputs, while at the same time we circumvent the decision of a single weak classifier by averaging over a collection of such decisions. The ESLII textbook (3) has a full chapter devoted to such techniques.

However, in RF there is a second level of randomization, this time at the level of the variables: only random subsets of the original variables are considered when growing a tree, typically $\sqrt{p}$, where $p$ is the number of predictors. This means that RF can work with a lot more variables than other traditional methods, especially those where we expect to have more observations than variables. In addition, this technique will be less impacted by collinearity (correlation between the predictors). Finally, subsampling the predictors has for a consequence that the fitted values across trees are more independent.

As depicted in the next figure, let's consider a block of predictors $X$ ($n\times p$) and a response vector $Y$, of length $n$, which stands for the outcome of interest (it might be a binary or numerical variable).

enter image description here

Here, we'll be considering a classification task ($Y$ is categorical). The three main steps are as follows:

  1. Take a random sample of the $n$ individuals; they will be used to build a single tree.
  2. Select $k<p$ variables and find the first split; iterate until the full tree is grown (do not prune); drop the OOB data down the tree and record the outcome assigned to each observation.

Do steps 1--2 $b$ times (say, $b=500$), and each time count the number of times each individual from the OOB sample is classified into one category or the other. Then assign each case a category by using a majority vote over the $b$ trees. You have your predicted outcomes.

In order to assess variable importance, that is how much a variable contributes to classification accuracy, we could estimate the extent to which the predictions would be affected if there were no association between the outcome and the given variable. An obvious way to do this is to shuffle the individual labels (this follows from the principle of re-randomization), and compute the difference in predictive accuracy with the original model. This is step 3. The more predictive a variable is (when trying to predict a future outcome--this really is forecasting, because remember that we only use our OOB sample for assessing the predictive ability of our model), the more likely the classification accuracy will decrease when breaking the link between its observed values and the outcome.


Now, about your question,

  • Reliable ways to assess the importance of your predictors have been proposed in the past (this might be the one described above, or additional resampling strategy related to backward elimination (4,5)); so there's no need to reinvent the wheel.
  • The hard task, IMO, is to choose your descriptors or features, that is the characteristics of your records. Note that it should not be specific to a subset of your units, but rather encompasses a general description of possible events, e.g. "One Shot Game" vs. other (I'm not very good at game :-).
  • Do an extensive literature search on what has been carried out in this direction on this specific field.
  • Do not break the control on overfitting that is guaranteed by RF by running additional feature selection + new classification tasks on the same sample.

If you're interested in knowing more about RF, there's probably a lot of additional information on this site, and currently available implementations of RF have been listed here.

References

  1. Sutton, C.D. (2005). Classification and Regression Trees, Bagging, and Boosting, in Handbook of Statistics, Vol. 24, pp. 303-329, Elsevier.
  2. Moissen, G.G. (2008). Classification and Regression Trees. Ecological Informatics, pp. 582-588.
  3. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer.
  4. Díaz-Uriarte, R., Alvarez de Andrés, S. (2006). Gene selection and classification of microarray data using Random Forest. BMC Bioinformatics, 7:3.
  5. Díaz-Uriarte, R. (2007). GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using Random Forest. BMC Bioinformatics, 8:328.
Related Question