Solved – Find max value of random forest regressor output

random forestregressionscikit learn

I was wondering, for scikit learns regressors (extra trees, random forest regressor etc), how can i find the combination of inputs that would give me the max value of the target variable?

Other than brute forcing all possible inputs,is there a smarter way? There's no documentation on this on scikit learn and would like to know more ๐Ÿ™‚

Thank you!

EDIT:

@gurg , What i mean is, say i have

train = pd.read_csv(blah)
dataset,target = splitDatasetTarget(train) # split to target and label data
clf = ExtraTreesRegressor()
clf.fit(dataset,target) #lets ignore the train test split part

now i have a model that is able to predict for a certain input (row), i have an output (value). How would I be able to find for which values of inputs do i get the max value of my target variable? (Assuming the target variable is bounded finitely)

The only way i can think of is randomly trying different combinations of:

clf.predict(data)

then seeing which data gives me the largest output

Best Answer

This answer covers random forest regression and it can be extended to probabilistic classification, where there is one maximal prediction for each class probability. This answer covers most typical random forest implementations(extraTrees included), not the funnies as randomUniformForest or rotationForest.

The maximal training set prediction is very likely to be a maximum part of the global maxima region in the model structure. I tried to train a randomForest(R) on a train set of 1000 obs and both predict a test set of 10^6 obs and train set. No test prediction where higher than max of train set predictions.

But not all unique observed test prediction values are observed in train predictions. Tree models split feature space in to regions and split regions recursively into sub regions. Each sub region is shaped liked a line-segment/rectangle/box/hyper-box and is assigned one prediction in its entire sub space. Forests are many tree region masks overlayed, thus cutting the hyper-boxes into even smaller regions. In contrary to tree models, not all regions have to be occupied by a train set prediction. Check this simulated example and notice empty regions in the diagonal of the point clusters in the $(x_1, x_2)$ plane. enter image description here xy-axis is the feature $x_1$ and $x_2$. z-axis is the forest prediction $\hat{y}$

(note: image is only a 3D-slice of the 6+1D model structure. But because regions are shaped like hyper-boxes, a given pointless(literally) region visualized in this slice are in fact so.)

set.seed(1)
library(randomForest)
library(forestFloor)
X  = data.frame(replicate(6, rnorm(1000)))
X[1:500,] = X[1:500,] + 4 
y  = 1/(X[,1]^2+X[,2]^2+.1) + 1/((X[,1]-4)^2+(X[,2]-4)^2+0.1) +rnorm(1000)
rf = randomForest(X, y,ntree=100)
vec.plot(rf,X,1:2,grid.lines=200,col="red",zoom=2)

Thus I was unable to proof, that the training set prediction in fact always is the global maxima of the structure, as some regions of a forest will be without training set observations. But I find it very likely to be the case. A million observation confirming this against infinit possible combinations is of course not a lot. But I cannotse edit imagine any structure where such diagonal pointless region in fact should be the global maxima region.

For single trees, the maximal training prediction is of course the global maximum of the model, as any terminal node is defined by training set observations.

I call it maxima regions because the region have a set of points of equal maximal predictions.

Next you may want to find the exact boundaries for the maxima region. Some outer regions in one or more dimensions will be unbounded towards infinity. But that's still some kind of answer. For every tree one have to pass through the one training observation with maximum prediction and collect all splits. Every split is defined by a given feature and a break point (categorical break points are possible also). From every split an inequality something like:

$x_1 \le 5.2$, if train observation had a value less or equal to 5.2

otherwise $x_1>5.2$

and for categorical splits (not natively supported by sklearn) $x_1 \in \{\text{"blue","purple","pink"}\}$

the same as $x_1 \not \in \{\text{"red","green"}\}$

Lastly for each feature you remove the redundant boundary inequalities. If $x_1 \le 5.2$ we don't care to know $x_1 \le 7.2$. You should end up with one or two inequalities per numerical feature and these will be the exact description of your maxima region.

[edit, #!ยค/! I can actually disprove my idea. But the example is quite devilish taking the above described 'diagonal effect' to the extreme. Thus there are at least extreme cases where testSet predictions are more extreme than training set predictions]. Imagine a 10D feature space with real axes and a center in (0,0,0,0,0,0,0,0,0,0). I place normal distributed data point clusters of each 500 points on +5 and -5 for all axis, thus 20 clusters in all. Points closest to the center of each cluster have higher true target values. Any point distant from a cluster will have a low true target value approaching zero at large distances from any cluster. Hereby I create a N=10,000, p=10 training set. I train a randomForest model on the training set. The center of the feature space, although relatively far from any cluster will have small increased prediction due to each clusters. As the center is neighbor to 20 clusters, the model structure actually have pointless maxima regions somewhere close by the center, and these exceeds any local maxima near to the clusters and region containing any training observations.

Hereby, in the simulation below I find some test predictions being higher than any train predictions.

set.seed(1)
library(randomForest)
library(forestFloor)
#create training set with 20 clusters
X1  = data.frame(replicate(10, rnorm(5000)))
for(i in 0:9) {
  X[i+1:500,i+1] = X[i+1:500,i+1] + 5  
}
X2  = data.frame(replicate(10, rnorm(5000)))
for(i in 0:9) {
  X[i+1:500,i+1] = X[i+1:500,i+1] - 5  
}
X = rbind(X1,X2)

#target equal to sum of inverse distances to clusters
#I come up with some 2D 'disc' distance function.
#All discs of the clusters have orthogonal orientations, 
#but are oriented in one dimension(i'th axis) towards the center.
y = apply(sapply(1:10, function(i) {
   if(i==10) j=1 else j=i+1
   1 / ((X[,i]-5)^2+(X[,j])^2+0.01) +
   1 / ((X[,i]+5)^2+(X[,j])^2+0.01)
  })
,1,sum)

rf = randomForest(X, y,ntree=20)
trainPred = predict(rf,X)
range(trainPred) #range of 10,000 train predictions
 0.8077388 2.2239506

Xtest = data.frame(replicate(10, rnorm(500000,sd=4)))
testPred = predict(rf,Xtest)
range(testPred) #range of 500,000 test predictions
 0.8433307 2.4964615
Related Question