Solved – How to incorporate constraints in random forest output

cartdecision-theorymachine learningnonlinear regressionrandom forest

Suppose I am doing random forest classification of labels $A$,$B$,$C$,$D$. There is some theoretical ordering to this output such that when $A$ is more likely than $B$, $B$ is also more likely than $C$, etc. Also, if $P(D) > P(C)$, we also have that $P(C) > P(B) > P(A)$. There are other such conditions that need to be met.

The issue is that a real random forest may give something silly that completely violates the above constraints, even if it is able to predict the most likely outcome successfully. For my use case the ordering is important since decisions are made not only on the most likely outcome.

It also seems intuitive that I should be able to improve generalization if I can somehow enforce this prior knowledge into the model.

How do I account for this in a decision forest? Despite this structure to the output I do not think it is possible to construct a real-valued response variable since they are still class labels with no natural real value, even if there is some type of ordering to them.

Best Answer

Here is a possibility: you could add a constraint to the optimization of the purity index (e.g. Gini Index or Entropy) to the individual trees in the forest. So: $$min\,\Sigma{D_i} \; with\;D_i=1-\Sigma^{k}p_{ik}^2$$ $$s.t.\, p_{ik} >= p_{i(k-1)} >= ... >= p_0$$ where $k$ indexes the observation type, $i$ indexes the terminal node and $p_{ik}$ is the proportion of of $k$ on node $i$. That way your forest should yield results consistent with that as well. I guess you could relax the condition by introducing a slack variable $min\,\zeta_i$ with $p_0>\zeta_0 > 0$ $p_0-\zeta_0 <= p_1-\zeta_1$, etc. for the other probs.

But if your data is correct and makes sense and that condition is true for sure your forest will yield results that are consistent with that condition. If you do an unconstrained forest with enough trees and you do not observe your $P(A) < P(B) < ...$ it is quite likely you are mixing non-comparable data sets or that the condition is simply not true.