(I'm far from expert. These are just musings from a junior statistician who has dealt with different, but loosely analogous, issues. My answer might be out of context.)
Given a new sample to be predicted, and an oracle which has access to a much larger training set, then maybe the "best" and most honest prediction is to say "I predict with 60% probability that this belongs in the Red class rather than the Blue class".
I'll give a more concrete example. Imagine that, in our very large training set, there is a large set of samples that are very similar to our new sample. Of these, 60% are blue and 40% are red. And there appears to be nothing to distinguish the Blues from the Red. In such a case, it's obvious that a 60%/40% is the only prediction a sane person can make.
Of course, we don't have such an oracle, instead we have lots of trees. Simple decision trees are incapable of making these 60%/40% predictions and hence each tree will make a discrete prediction (Red or Blue, nothing in between). As this new sample falls just on the Red side of the decision surface, you will find that almost all of the trees predict Red rather than Blue. Each tree pretends to be more certain than it is and it starts a stampede towards a biased prediction.
The problem is that we tend to misinterpret the decision from a single tree. When a single tree puts a node in the Red class, we should not interpret that as a 100%/0% prediction from the tree. (I'm not just saying that we 'know' that it's probably a bad prediction. I'm saying something stronger, i.e. that we should be careful we interpret as being the tree's prediction). I can't concisely expand on how to fix this. But it is possible to borrow ideas from areas of statistics about how to construct more 'fuzzy' splits within a tree in order to encourage a single tree to be more honest about its uncertainty. Then, it should be possible to meaningfully average the predictions from a forest of trees.
I hope this helps a little. If not, I hope to learn from any responses.
Each tree in the forest is built from a bootstrap sample of the observations in your training data. Those observations in the bootstrap sample build the tree, whilst those not in the bootstrap sample form the out-of-bag (or OOB) samples.
It should be clear that the same variables are available for cases in the data used to build a tree as for the cases in the OOB sample. To get predictions for the OOB sample, each one is passed down the current tree and the rules for the tree followed until it arrives in a terminal node. That yields the OOB predictions for that particular tree.
This process is repeated a large number of times, each tree trained on a new bootstrap sample from the training data and predictions for the new OOB samples derived.
As the number of trees grows, any one sample will be in the OOB samples more than once, thus the "average" of the predictions over the N trees where a sample is in the OOB is used as the OOB prediction for each training sample for trees 1, ..., N. By "average" we use the mean of the predictions for a continuous response, or the majority vote may be used for a categorical response (the majority vote is the class with most votes over the set of trees 1, ..., N).
For example, assume we had the following OOB predictions for 10 samples in training set on 10 trees
set.seed(123)
oob.p <- matrix(rpois(100, lambda = 4), ncol = 10)
colnames(oob.p) <- paste0("tree", seq_len(ncol(oob.p)))
rownames(oob.p) <- paste0("samp", seq_len(nrow(oob.p)))
oob.p[sample(length(oob.p), 50)] <- NA
oob.p
> oob.p
tree1 tree2 tree3 tree4 tree5 tree6 tree7 tree8 tree9 tree10
samp1 NA NA 7 8 2 1 NA 5 3 2
samp2 6 NA 5 7 3 NA NA NA NA NA
samp3 3 NA 5 NA NA NA 3 5 NA NA
samp4 6 NA 10 6 NA NA 3 NA 6 NA
samp5 NA 2 NA NA 2 NA 6 4 NA NA
samp6 NA 7 NA 4 NA 2 4 2 NA NA
samp7 NA NA NA 5 NA NA NA 3 9 5
samp8 7 1 4 NA NA 5 6 NA 7 NA
samp9 4 NA NA 3 NA 7 6 3 NA NA
samp10 4 8 2 2 NA NA 4 NA NA 4
Where NA
means the sample was in the training data for that tree (in other words it was not in the OOB sample).
The mean of the non-NA
values for each row gives the the OOB prediction for each sample, for the entire forest
> rowMeans(oob.p, na.rm = TRUE)
samp1 samp2 samp3 samp4 samp5 samp6 samp7 samp8 samp9 samp10
4.00 5.25 4.00 6.20 3.50 3.80 5.50 5.00 4.60 4.00
As each tree is added to the forest, we can compute the OOB error up to an including that tree. For example, below are the cummulative means for each sample:
FUN <- function(x) {
na <- is.na(x)
cs <- cumsum(x[!na]) / seq_len(sum(!na))
x[!na] <- cs
x
}
t(apply(oob.p, 1, FUN))
> print(t(apply(oob.p, 1, FUN)), digits = 3)
tree1 tree2 tree3 tree4 tree5 tree6 tree7 tree8 tree9 tree10
samp1 NA NA 7.00 7.50 5.67 4.50 NA 4.6 4.33 4.0
samp2 6 NA 5.50 6.00 5.25 NA NA NA NA NA
samp3 3 NA 4.00 NA NA NA 3.67 4.0 NA NA
samp4 6 NA 8.00 7.33 NA NA 6.25 NA 6.20 NA
samp5 NA 2 NA NA 2.00 NA 3.33 3.5 NA NA
samp6 NA 7 NA 5.50 NA 4.33 4.25 3.8 NA NA
samp7 NA NA NA 5.00 NA NA NA 4.0 5.67 5.5
samp8 7 4 4.00 NA NA 4.25 4.60 NA 5.00 NA
samp9 4 NA NA 3.50 NA 4.67 5.00 4.6 NA NA
samp10 4 6 4.67 4.00 NA NA 4.00 NA NA 4.0
In this way we see how the prediction is accumulated over the N trees in the forest up to a given iteration. If you read across the rows, the right-most non-NA
value is the one I show above for the OOB prediction. That is how traces of OOB performance can be made - a RMSEP can be computed for the OOB samples based on the OOB predictions accumulated cumulatively over the N trees.
Note that the R code shown is not take from the internals of the randomForest code in the randomForest package for R - I just knocked up some simple code so that you can follow what is going on once the predictions from each tree are determined.
It is because each tree is built from a bootstrap sample and that there are a large number of trees in a random forest, such that each training set observation is in the OOB sample for one or more trees, that OOB predictions can be provided for all samples in the training data.
I have glossed over issues such as missing data for some OOB cases etc, but these issues also pertain to a single regression or classification tree. Also note that each tree in a forest uses only mtry
randomly-selected variables.
Best Answer
Here there are some thoughts:
I would go for 1. or 2.