Random Forest – Ideas for Outputting Prediction Equations

predictionrandom forest

I've read through the following posts that answered the question I was going to ask:

Use Random Forest model to make predictions from sensor data

Here's what I've done so far: I compared Logistic Regression to Random Forests and RF outperformed Logistic. Now the medical researchers I work with want to turn my RF results into a medical diagnostic tool. For example:

If you are an Asian Male between 25 and 35, have Vitamin D below xx and Blood Pressure above xx, you have a 76% chance of developing disease xxx.

However, RF doesn't lend itself to simple mathematical equations (see above links). So here's my question: what ideas do you all have for using RF to develop a diagnostic tool (without having to export hundreds of trees).

Here's a few of my ideas:

Use RF for variable selection, then use Logistic (using all possible interactions) to make the diagnostic equation.
Somehow aggregate the RF forest into one "mega-tree," that somehow averages the node splits across trees.
Similar to #2 and #1, use RF to select variables (say m variables total), then build hundreds of classification trees, all of which uses every m variable, then pick the best single tree.

Any other ideas? Also, doing #1 is easy, but any ideas on how to implement #2 and #3?

Best Answer

Here there are some thoughts:

All black-box models might be inspected in some way. You can compute the variable importance for each feature for example or you can also plot the predicted response and the actual one for each feature (link);
You might think about some pruning of the ensemble. Not all the trees in the forest are necessary and you might use just a few. Paper: [Search for the Smallest Random Forest, Zhang]. Otherwise just Google "ensemble pruning", and have a look at "Ensemble Methods: Foundations and Algorithms " Chapter 6;
You can build a single model by feature selection as you said. Otherwise you can also try to use Domingos' method in [Knowledge acquisition from examples via multiple models] that consists in building a new dataset with black-box predictions and build a decision tree on top of it.
As mentioned in this Stack Exchange's answer, a tree model might seem interpretable but it is prone to high changes just because of small perturbations of the training data. Thus, it is better to use a black-box model. The final aim of an end user is to understand why a new record is classified as a particular class. You might think about some feature importances just for that particular record.

I would go for 1. or 2.

Related Solutions

Algorithms – Do Random Forests Exhibit Prediction Bias?

(I'm far from expert. These are just musings from a junior statistician who has dealt with different, but loosely analogous, issues. My answer might be out of context.)

Given a new sample to be predicted, and an oracle which has access to a much larger training set, then maybe the "best" and most honest prediction is to say "I predict with 60% probability that this belongs in the Red class rather than the Blue class".

I'll give a more concrete example. Imagine that, in our very large training set, there is a large set of samples that are very similar to our new sample. Of these, 60% are blue and 40% are red. And there appears to be nothing to distinguish the Blues from the Red. In such a case, it's obvious that a 60%/40% is the only prediction a sane person can make.

Of course, we don't have such an oracle, instead we have lots of trees. Simple decision trees are incapable of making these 60%/40% predictions and hence each tree will make a discrete prediction (Red or Blue, nothing in between). As this new sample falls just on the Red side of the decision surface, you will find that almost all of the trees predict Red rather than Blue. Each tree pretends to be more certain than it is and it starts a stampede towards a biased prediction.

The problem is that we tend to misinterpret the decision from a single tree. When a single tree puts a node in the Red class, we should not interpret that as a 100%/0% prediction from the tree. (I'm not just saying that we 'know' that it's probably a bad prediction. I'm saying something stronger, i.e. that we should be careful we interpret as being the tree's prediction). I can't concisely expand on how to fix this. But it is possible to borrow ideas from areas of statistics about how to construct more 'fuzzy' splits within a tree in order to encourage a single tree to be more honest about its uncertainty. Then, it should be possible to meaningfully average the predictions from a forest of trees.

I hope this helps a little. If not, I hope to learn from any responses.

Solved – Random forest and prediction

Each tree in the forest is built from a bootstrap sample of the observations in your training data. Those observations in the bootstrap sample build the tree, whilst those not in the bootstrap sample form the out-of-bag (or OOB) samples.

It should be clear that the same variables are available for cases in the data used to build a tree as for the cases in the OOB sample. To get predictions for the OOB sample, each one is passed down the current tree and the rules for the tree followed until it arrives in a terminal node. That yields the OOB predictions for that particular tree.

This process is repeated a large number of times, each tree trained on a new bootstrap sample from the training data and predictions for the new OOB samples derived.

As the number of trees grows, any one sample will be in the OOB samples more than once, thus the "average" of the predictions over the N trees where a sample is in the OOB is used as the OOB prediction for each training sample for trees 1, ..., N. By "average" we use the mean of the predictions for a continuous response, or the majority vote may be used for a categorical response (the majority vote is the class with most votes over the set of trees 1, ..., N).

For example, assume we had the following OOB predictions for 10 samples in training set on 10 trees

set.seed(123)
oob.p <- matrix(rpois(100, lambda = 4), ncol = 10)
colnames(oob.p) <- paste0("tree", seq_len(ncol(oob.p)))
rownames(oob.p) <- paste0("samp", seq_len(nrow(oob.p)))
oob.p[sample(length(oob.p), 50)] <- NA
oob.p

> oob.p
       tree1 tree2 tree3 tree4 tree5 tree6 tree7 tree8 tree9 tree10
samp1     NA    NA     7     8     2     1    NA     5     3      2
samp2      6    NA     5     7     3    NA    NA    NA    NA     NA
samp3      3    NA     5    NA    NA    NA     3     5    NA     NA
samp4      6    NA    10     6    NA    NA     3    NA     6     NA
samp5     NA     2    NA    NA     2    NA     6     4    NA     NA
samp6     NA     7    NA     4    NA     2     4     2    NA     NA
samp7     NA    NA    NA     5    NA    NA    NA     3     9      5
samp8      7     1     4    NA    NA     5     6    NA     7     NA
samp9      4    NA    NA     3    NA     7     6     3    NA     NA
samp10     4     8     2     2    NA    NA     4    NA    NA      4

Where NA means the sample was in the training data for that tree (in other words it was not in the OOB sample).

The mean of the non-NA values for each row gives the the OOB prediction for each sample, for the entire forest

> rowMeans(oob.p, na.rm = TRUE)
 samp1  samp2  samp3  samp4  samp5  samp6  samp7  samp8  samp9 samp10 
  4.00   5.25   4.00   6.20   3.50   3.80   5.50   5.00   4.60   4.00

As each tree is added to the forest, we can compute the OOB error up to an including that tree. For example, below are the cummulative means for each sample:

FUN <- function(x) {
  na <- is.na(x)
  cs <- cumsum(x[!na]) / seq_len(sum(!na))
  x[!na] <- cs
  x
}
t(apply(oob.p, 1, FUN))

> print(t(apply(oob.p, 1, FUN)), digits = 3)
       tree1 tree2 tree3 tree4 tree5 tree6 tree7 tree8 tree9 tree10
samp1     NA    NA  7.00  7.50  5.67  4.50    NA   4.6  4.33    4.0
samp2      6    NA  5.50  6.00  5.25    NA    NA    NA    NA     NA
samp3      3    NA  4.00    NA    NA    NA  3.67   4.0    NA     NA
samp4      6    NA  8.00  7.33    NA    NA  6.25    NA  6.20     NA
samp5     NA     2    NA    NA  2.00    NA  3.33   3.5    NA     NA
samp6     NA     7    NA  5.50    NA  4.33  4.25   3.8    NA     NA
samp7     NA    NA    NA  5.00    NA    NA    NA   4.0  5.67    5.5
samp8      7     4  4.00    NA    NA  4.25  4.60    NA  5.00     NA
samp9      4    NA    NA  3.50    NA  4.67  5.00   4.6    NA     NA
samp10     4     6  4.67  4.00    NA    NA  4.00    NA    NA    4.0

In this way we see how the prediction is accumulated over the N trees in the forest up to a given iteration. If you read across the rows, the right-most non-NA value is the one I show above for the OOB prediction. That is how traces of OOB performance can be made - a RMSEP can be computed for the OOB samples based on the OOB predictions accumulated cumulatively over the N trees.

Note that the R code shown is not take from the internals of the randomForest code in the randomForest package for R - I just knocked up some simple code so that you can follow what is going on once the predictions from each tree are determined.

It is because each tree is built from a bootstrap sample and that there are a large number of trees in a random forest, such that each training set observation is in the OOB sample for one or more trees, that OOB predictions can be provided for all samples in the training data.

I have glossed over issues such as missing data for some OOB cases etc, but these issues also pertain to a single regression or classification tree. Also note that each tree in a forest uses only mtry randomly-selected variables.

Best Answer

Related Solutions

Algorithms – Do Random Forests Exhibit Prediction Bias?

Solved – Random forest and prediction

Related Question