Random Forest – Ideas for Outputting Prediction Equations

predictionrandom forest

I've read through the following posts that answered the question I was going to ask:

Use Random Forest model to make predictions from sensor data

Decision tree for output prediction

Here's what I've done so far: I compared Logistic Regression to Random Forests and RF outperformed Logistic. Now the medical researchers I work with want to turn my RF results into a medical diagnostic tool. For example:

If you are an Asian Male between 25 and 35, have Vitamin D below xx and Blood Pressure above xx, you have a 76% chance of developing disease xxx.

However, RF doesn't lend itself to simple mathematical equations (see above links). So here's my question: what ideas do you all have for using RF to develop a diagnostic tool (without having to export hundreds of trees).

Here's a few of my ideas:

  1. Use RF for variable selection, then use Logistic (using all possible interactions) to make the diagnostic equation.
  2. Somehow aggregate the RF forest into one "mega-tree," that somehow averages the node splits across trees.
  3. Similar to #2 and #1, use RF to select variables (say m variables total), then build hundreds of classification trees, all of which uses every m variable, then pick the best single tree.

Any other ideas? Also, doing #1 is easy, but any ideas on how to implement #2 and #3?

Best Answer

Here there are some thoughts:

  1. All black-box models might be inspected in some way. You can compute the variable importance for each feature for example or you can also plot the predicted response and the actual one for each feature (link);
  2. You might think about some pruning of the ensemble. Not all the trees in the forest are necessary and you might use just a few. Paper: [Search for the Smallest Random Forest, Zhang]. Otherwise just Google "ensemble pruning", and have a look at "Ensemble Methods: Foundations and Algorithms " Chapter 6;
  3. You can build a single model by feature selection as you said. Otherwise you can also try to use Domingos' method in [Knowledge acquisition from examples via multiple models] that consists in building a new dataset with black-box predictions and build a decision tree on top of it.
  4. As mentioned in this Stack Exchange's answer, a tree model might seem interpretable but it is prone to high changes just because of small perturbations of the training data. Thus, it is better to use a black-box model. The final aim of an end user is to understand why a new record is classified as a particular class. You might think about some feature importances just for that particular record.

I would go for 1. or 2.