Based on a response to the question "Statistical classification of text", I've decided that I want to use a random forest to perform various classifications of text.
Specifically, I am getting YouTube video data from their XML feeds. Here is an example:
<entry gd:etag="W/"DUQBR347eCp7I2A9WhdXFEQ."">
<id>tag:youtube.com,2008:video:ltkMAcv2cas</id>
<published>2010-12-23T18:11:28.000Z</published>
<updated>2011-08-28T02:09:16.000Z</updated>
<category scheme="http://schemas.google.com/g/2005#kind"
term="http://gdata.youtube.com/schemas/2007#video"/>
<category scheme="http://gdata.youtube.com/schemas/2007/keywords.cat"
term="SSF4AE"/>
<!-- More categories/keywords here -->
<title>SSF4AE Mago 2D God VS Kazunoko aka Inoue</title>
<!-- More video information here, description, etc -->
</entry>
With that, I have very specific data that I believe I can use as the basis for variables in a random forest:
Variable Value
-------- -----
Title SSF4AE Mago 2D God VS Kazunoko aka Inoue
Category SSF4AE
Looking at the video and based on the title/categories, I know that this is a video of the game Super Street Fighter 4: Arcade Edition (SSF4AE).
Looking at the alglib Decision Forest, I also read the general principals of data analysis methods, which speaks about encoding specifically for the algorithms on alglib; while I won't use it specifically, I believe it expanded my understanding of how data should be represented for a random forest.
Basically, using the above data, I'd need to transform each data point into a number of some sort.
For example, I could transform this into the following variables:
Variable Value
-------- -----
Has "SSF4AE" in Title 1
Equals "SSF4AE" 1
Now, assuming I know that this video is for SSF4AE, and that I know that SSF4AE is a common acronym used to identify the game, would using the above as variables work for a random forest for training and classification?
Of course, that's a small set, and it's a very rough set of variables; my understanding is that a random forest basically will create nodes where there is a threshold for the variable, values over this fall on one side of the tree, values under on the other side.
In the above example, there are only two options, true or false.
However, I imagine I can do something like get the term-frequency/inverse document frequency and use that (using Lucene.NET or elasticsearch), like so (numbers are not accurate)
Variable Value
-------- -----
Term Query for "SSF4AE" in Title 0.36
Term Query for "SSF4AE" in Category 1.00
And then train the random forest on that.
I assume that if the variables were done as above, then I could also place things such as boosts on the variables (in the tf/idf calculation) where I see that they are more prominent or I feel I want them to have more weight.
And finally, I can add more variables, but they have to be very specific. For example, I know that if there is a character Yun that only appears in SSF4AE, (not previous versions of Street Fighter 4), as well as in Street Fighter III. To that end, I can add two more variables:
Variable Value
-------- -----
Term Query for "SSF4AE" in Title 0.36
Term Query for "SSF4AE" in Category 1.00
One character is Yun 1
Term Query for "3S" in Title 0.00
Term Query for "3SO" in Title 0.00
("3SO" and "3S" being common acronyms for Street Fighter III: Third Strike)
However, for another video, which I know is for 3S, I the variables might look like this:
Variable Value
-------- -----
Term Query for "SSF4AE" in Title 0.00
Term Query for "SSF4AE" in Category 0.00
One character is Yun 1
Term Query for "3S" in Title 0.54
Term Query for "3SO" in Title 0.00
Where the training set would indicate that video is for 3S.
That said, are these feasible ways of going about determining the variables of a random forest?
Also, if I understand correctly, the random forest will also indicate which variables contributed the most to the classification, so I can discard them during classification later (if I want) or add different ones to get better classifications from my training data.
Best Answer
As requested, I'll turn my comment to an answer, although I doubt it will provide you with a working solution.
The Random Forest™ algorithm is well described on Breiman's webpage. See references therein. Basically, it relies on the idea of bagging (bootstrap aggregation) where we "average" the results of one (or more) classifier run over different subsamples of the original dataset (1). This is part of what we called more generally ensemble learning. The advantage of this approach is that it allows to keep a "training" sample (approx. 2/3 when sampling with replacement) to build the classifier, and a "validation" sample to evaluate its performance; the latter is also known as the out-of-bag (OOB) sample. Classification and regression trees (CART, e.g. (2)) work generally well within this framework, but it is worth noting that in this particular case we do not want to prune the tree (to avoid overfitting; on the contrary, we deliberately choose to work with "weak" classifiers). By constructing several models and combining their individual estimates (most of the times, by a majority vote), we introduce some kind of variety, or fluctuations, in model outputs, while at the same time we circumvent the decision of a single weak classifier by averaging over a collection of such decisions. The ESLII textbook (3) has a full chapter devoted to such techniques.
However, in RF there is a second level of randomization, this time at the level of the variables: only random subsets of the original variables are considered when growing a tree, typically $\sqrt{p}$, where $p$ is the number of predictors. This means that RF can work with a lot more variables than other traditional methods, especially those where we expect to have more observations than variables. In addition, this technique will be less impacted by collinearity (correlation between the predictors). Finally, subsampling the predictors has for a consequence that the fitted values across trees are more independent.
As depicted in the next figure, let's consider a block of predictors $X$ ($n\times p$) and a response vector $Y$, of length $n$, which stands for the outcome of interest (it might be a binary or numerical variable).
Here, we'll be considering a classification task ($Y$ is categorical). The three main steps are as follows:
Do steps 1--2 $b$ times (say, $b=500$), and each time count the number of times each individual from the OOB sample is classified into one category or the other. Then assign each case a category by using a majority vote over the $b$ trees. You have your predicted outcomes.
In order to assess variable importance, that is how much a variable contributes to classification accuracy, we could estimate the extent to which the predictions would be affected if there were no association between the outcome and the given variable. An obvious way to do this is to shuffle the individual labels (this follows from the principle of re-randomization), and compute the difference in predictive accuracy with the original model. This is step 3. The more predictive a variable is (when trying to predict a future outcome--this really is forecasting, because remember that we only use our OOB sample for assessing the predictive ability of our model), the more likely the classification accuracy will decrease when breaking the link between its observed values and the outcome.
Now, about your question,
If you're interested in knowing more about RF, there's probably a lot of additional information on this site, and currently available implementations of RF have been listed here.
References