Solved – When to avoid Random Forest

classificationmachine learningrandom forest

Random forests are well known to perform fairly well on a variety of tasks and have been referred to as the leatherman of learning methods. Are there any types of problems or specific conditions in which one should avoid using a random forest?

Best Answer

Thinking about the specific language of the quotation, a leatherman is a multi-tool: a single piece of hardware with lots of little gizmos tucked into it. It's a pair of pliers, and a knife, and a screwdriver and more! Rather than having to carry each of these tools individually, the leatherman is a single item that you can clip to your trousers so it's always at hand. This is convenient, but the trade-off is that each of the component tools is not the best at its job. The can opener is hard to use, the screwdriver bits are usually the wrong size, and the knife can accomplish little more than whittling. If doing any of these tasks is critical, you'd be better served with a specialized tool: an actual knife, an actual screwdriver, or an actual pair of pliers.

A random forest can be thought of in the same terms. Random forest yields strong results on a variety of data sets, and is not incredibly sensitive to tuning parameters. But it's not perfect. The more you know about the problem, the easier it is to build specialized models to accommodate your particular problem.

There are a couple of obvious cases where random forests will struggle:

  • Sparsity - When the data are very sparse, it's very plausible that for some node, the bootstrapped sample and the random subset of features will collaborate to produce an invariant feature space. There's no productive split to be had, so it's unlikely that the children of this node will be at all helpful. XGBoost can do better in this context.

  • Data are not axis-aligned - Suppose that there is a diagonal decision boundary in the space of two features, $x_1$ and $x_2$. Even if this is the only relevant dimension to your data, it will take an ordinary random forest model many splits to describe that diagonal boundary. This is because each split is oriented perpendicular to the axis of either $x_1$ or $x_2$. (This should be intuitive because an ordinary random forest model is making splits of the form $x_1>4$.) Rotation forest, which performs a PCA projection on the subset of features selected for each split, can be used to overcome this: the projections into an orthogonal basis will, in principle, reduce the influence of the axis-aligned property because the splits will no longer be axis-aligned in the original basis.

    This image provides another example of how axis-aligned splits influence random forest decisions. The decision boundary is a circle at the origin, but note that this particular random forest model draws a box to approximate the circle. There are a number of things one could do to improve this boundary; the simplest include gathering more data and building more trees. enter image description here

  • Random forests basically only work on tabular data, i.e. there is not a strong, qualitatively important relationship among the features in the sense of the data being an image, or the observations being networked together on a graph. These structures are typically not well-approximated by many rectangular partitions. If your data live in a time series, or are a series of images, or live on a graph, or have some other obvious structure, the random forest will have a very hard time recognizing that. I have no doubt that researchers have developed variations on the method to attempt to accommodate these situations, but a vanilla random forest won't necessarily pick up on these structures in a helpful way. The good news is that you typically know when this is the case, i.e. you know you have images, a time-series or a graph to work with, so you can immediately apply a method more appropriate to that type of data.