Solved – Would a Random Forest with multiple outputs be possible/practical

cartmachine learningmonte carlomultilabelrandom forest

  1. Random Forests (RFs) is a competitive data modeling/mining method.

  2. An RF model has one output — the output/prediction variable.

  3. The naive approach to modeling multiple outputs with RFs would be
    to construct an RF for each output variable. So we have N independent
    models, and where there is correlation between output variables we
    will have redundant/duplicate model structure. This could be very
    wasteful, indeed. Also as a general rule more model variables implies a
    more overfit model (less generalisation). Not sure if this applies
    here but it probably does.

In principle we could have an RF with multiple outputs. The prediction
variable is now a vector (n-tuple). The decision nodes in each
decision tree are now splitting the set of target/prediction vectors
based on a threshold vector, I figure this threshold is taken to be a
plane in the n-dimensional space and that therefore we can determine
which side of the threshold vector each of the target vectors is on.

The optimal prediction value for each side of the decision split is
the mean (centroid) calculated for the vectors on each side.

Finding the optimal split point when working with single variables is
trivial and computationally fast/efficient. For an n-tuple we cannot
find the optimal split (or at least it becomes computationally
infeasible as N increases), but we may be able to find a near optimal
split using a Monte Carlo type method (or some hybrid of Monte Carlo
and local gradient traversal).

Would this actually work? That is, would it just map the training
pairs without generalising? Does this technique already exist under a different name?

You might also want to consider how this relates to neural nets such as Restricted Boltzmann Machines (RBMs) and Deep Belief Networks.

Best Answer

Multiple output decision trees (and hence, random forests) have been developed and published. Pierre Guertz distributes a package for this (download). See also Segal & Xiao, Multivariate random forests, WIREs Data Mining Knowl Discov 2011 1 80–87, DOI: 10.1002/widm.12 I believe the latest version of Scikit-learn also supports this. A good review of the state of the art can be found in the thesis by Henrik Linusson entitled "MULTI-OUTPUT RANDOM FORESTS". The simplest method for making the split choices at each node is to randomly choose ONE of the output variables and then follow the usual random forest approach for choosing a split. Other methods based on a weighted sum of the mutual information score with respect to each input feature and output variable have been developed, but they are quite expensive compared to the randomized approach.

Related Question