Solved – How to incorporate constraints in random forest output

cartdecision-theorymachine learningnonlinear regressionrandom forest

Suppose I am doing random forest classification of labels $A$,$B$,$C$,$D$. There is some theoretical ordering to this output such that when $A$ is more likely than $B$, $B$ is also more likely than $C$, etc. Also, if $P(D) > P(C)$, we also have that $P(C) > P(B) > P(A)$. There are other such conditions that need to be met.

The issue is that a real random forest may give something silly that completely violates the above constraints, even if it is able to predict the most likely outcome successfully. For my use case the ordering is important since decisions are made not only on the most likely outcome.

It also seems intuitive that I should be able to improve generalization if I can somehow enforce this prior knowledge into the model.

How do I account for this in a decision forest? Despite this structure to the output I do not think it is possible to construct a real-valued response variable since they are still class labels with no natural real value, even if there is some type of ordering to them.

Best Answer

Here is a possibility: you could add a constraint to the optimization of the purity index (e.g. Gini Index or Entropy) to the individual trees in the forest. So: $$min\,\Sigma{D_i} \; with\;D_i=1-\Sigma^{k}p_{ik}^2$$ $$s.t.\, p_{ik} >= p_{i(k-1)} >= ... >= p_0$$ where $k$ indexes the observation type, $i$ indexes the terminal node and $p_{ik}$ is the proportion of of $k$ on node $i$. That way your forest should yield results consistent with that as well. I guess you could relax the condition by introducing a slack variable $min\,\zeta_i$ with $p_0>\zeta_0 > 0$ $p_0-\zeta_0 <= p_1-\zeta_1$, etc. for the other probs.

But if your data is correct and makes sense and that condition is true for sure your forest will yield results that are consistent with that condition. If you do an unconstrained forest with enough trees and you do not observe your $P(A) < P(B) < ...$ it is quite likely you are mixing non-comparable data sets or that the condition is simply not true.

Related Solutions

Solved – way to explain a prediction from a random forest model

First idea is just to mimic the knock-out strategy from variable importance and just test how mixing each attribute will degenerate the forest confidence in object classification (on OOB and with some repetitions obviously). This requires some coding, but is certainly achievable.

However, I feel it is just a bad idea -- the result will be probably variable like hell (without stabilizing impact of averaging over objects), noisy (for not-so-confident objects the nonsense attributes could have big impacts) and hard to interpret (two or more attribute cooperative rules will probably result in random impacts of each contributing attributes).

Not to leave you with negative answer, I would rather try to look at the proximity matrix and the possible archetypes it may reveal -- this seems much more stable and straightforward.

Solved – Why Decision tree is outperforming Random Forest in this simple case

In addition to @mariodeng's answer which explains why the random forest trained with default parameters is worse here, here's an explanation why it may not be better than single trees in your experiment anyways:

Aggregated/ensemble models are not universally better than their "single" counterparts, they are better if and only if the single models suffer of instability.

With 1000 training rows and only 3 columns, you are in a comfortable training sample size situation in which even a decision tree may get reasonably stable. (For 3d data you can easily check the variation you have in the assignment of input space to the classes when rerunning the experiment.)

If the predictions of the trees are stable, all submodels in the ensemble return the same prediction and then the prediction of the random forest is just the same as the prediction of each single tree.
So then not only will the overall performance be the same, it will be the same cases that are predicted correctly and wrongly, respectively.

This is the case in your example:

table (predict(dtFit, test) [, 2], predict (rfFit, test))

#     0  1
#  0 46  0
#  1  0 54

why not 100% accurate?

You train on data that is not representative for the test cases: the test cases cover regions of the input space that never appear in the training data. There is no way for a model to know which class (if any - or maybe a 3rd? ...) cases far outside training space should belong to.

Particularly for highly nonlinear partitioning models (such as the decision trees), leaving training space will typically rather sooner than later lead to disaster.

If you plan to train on one class only, you need to look into so-called one-class classifiers which try to establish independent boundaries for each class. One-class classification of your toy data should give you the result that the out-of-training-space cases do not belong to any of the known classes.

Decision trees are a partitioning method, they cannot do one-class classification.

Best Answer

Related Solutions

Solved – way to explain a prediction from a random forest model

Solved – Why Decision tree is outperforming Random Forest in this simple case

Related Question