It's easy with scipy.optimize.fmin, evaluating your model on sampled points or rolling your own optimization routine. This is the bread and butter of how many models work, so it's worth learning about in great detail. See for instance, these lecture notes.
This probably isn't what you want though. The $r^2$ you cross-validated is for the distribution of the input space. The $\operatorname*{arg\,max}_x f(x)$ that the random forest will evaluate to will probably be in a low confidence neighborhood, away from the support of the training data.
You want a model that generates confidence intervals. In Scikit-Learn, GaussianProcess and GradientBoostingRegressor both do this. Gaussian Processes are excellent for this problem if you don't have more than a thousand training observations.
If you can collect more data after consulting with your model, then you evaluate your black box at the $\arg\max$ of the upper confidence bound, add the result as a new data-point and repeat until there is no change.
This problem is known as the Contextual Bandit. The choice of confidence bound is dependent on the exploration until convergence/exploitation of the intermittent values that you want. Since you don't care about how poorly the model performs while training, you'd pick a large upper confidence bound so the model will converge faster.
There are several related toolkits for this sort of problem. Spearmint, Hyperopt, and MOE.
If you cannot collect more data then you should take the $\arg\max$ of the lower confidence bound of the model. This penalizes predictions that the model is uncertain of.
This answer covers random forest regression and it can be extended to probabilistic classification, where there is one maximal prediction for each class probability. This answer covers most typical random forest implementations(extraTrees included), not the funnies as randomUniformForest
or rotationForest
.
The maximal training set prediction is very likely to be a maximum part of the global maxima region in the model structure. I tried to train a randomForest(R) on a train set of 1000 obs and both predict a test set of 10^6 obs and train set. No test prediction where higher than max of train set predictions.
But not all unique observed test prediction values are observed in train predictions. Tree models split feature space in to regions and split regions recursively into sub regions. Each sub region is shaped liked a line-segment/rectangle/box/hyper-box and is assigned one prediction in its entire sub space. Forests are many tree region masks overlayed, thus cutting the hyper-boxes into even smaller regions. In contrary to tree models, not all regions have to be occupied by a train set prediction. Check this simulated example and notice empty regions in the diagonal of the point clusters in the $(x_1, x_2)$ plane.
xy-axis is the feature $x_1$ and $x_2$. z-axis is the forest prediction $\hat{y}$
(note: image is only a 3D-slice of the 6+1D model structure. But because regions are shaped like hyper-boxes, a given pointless(literally) region visualized in this slice are in fact so.)
set.seed(1)
library(randomForest)
library(forestFloor)
X = data.frame(replicate(6, rnorm(1000)))
X[1:500,] = X[1:500,] + 4
y = 1/(X[,1]^2+X[,2]^2+.1) + 1/((X[,1]-4)^2+(X[,2]-4)^2+0.1) +rnorm(1000)
rf = randomForest(X, y,ntree=100)
vec.plot(rf,X,1:2,grid.lines=200,col="red",zoom=2)
Thus I was unable to proof, that the training set prediction in fact always is the global maxima of the structure, as some regions of a forest will be without training set observations. But I find it very likely to be the case. A million observation confirming this against infinit possible combinations is of course not a lot. But I cannotse edit imagine any structure where such diagonal pointless region in fact should be the global maxima region.
For single trees, the maximal training prediction is of course the global maximum of the model, as any terminal node is defined by training set observations.
I call it maxima regions because the region have a set of points of equal maximal predictions.
Next you may want to find the exact boundaries for the maxima region. Some outer regions in one or more dimensions will be unbounded towards infinity. But that's still some kind of answer. For every tree one have to pass through the one training observation with maximum prediction and collect all splits. Every split is defined by a given feature and a break point (categorical break points are possible also). From every split an inequality something like:
$x_1 \le 5.2$, if train observation had a value less or equal to 5.2
otherwise $x_1>5.2$
and for categorical splits (not natively supported by sklearn) $x_1 \in
\{\text{"blue","purple","pink"}\}$
the same as $x_1 \not \in \{\text{"red","green"}\}$
Lastly for each feature you remove the redundant boundary inequalities. If $x_1 \le 5.2$ we don't care to know $x_1 \le 7.2$. You should end up with one or two inequalities per numerical feature and these will be the exact description of your maxima region.
[edit, #!ยค/! I can actually disprove my idea. But the example is quite devilish taking the above described 'diagonal effect' to the extreme. Thus there are at least extreme cases where testSet predictions are more extreme than training set predictions]. Imagine a 10D feature space with real axes and a center in (0,0,0,0,0,0,0,0,0,0). I place normal distributed data point clusters of each 500 points on +5 and -5 for all axis, thus 20 clusters in all. Points closest to the center of each cluster have higher true target values. Any point distant from a cluster will have a low true target value approaching zero at large distances from any cluster. Hereby I create a N=10,000, p=10 training set. I train a randomForest model on the training set. The center of the feature space, although relatively far from any cluster will have small increased prediction due to each clusters. As the center is neighbor to 20 clusters, the model structure actually have pointless maxima regions somewhere close by the center, and these exceeds any local maxima near to the clusters and region containing any training observations.
Hereby, in the simulation below I find some test predictions being higher than any train predictions.
set.seed(1)
library(randomForest)
library(forestFloor)
#create training set with 20 clusters
X1 = data.frame(replicate(10, rnorm(5000)))
for(i in 0:9) {
X[i+1:500,i+1] = X[i+1:500,i+1] + 5
}
X2 = data.frame(replicate(10, rnorm(5000)))
for(i in 0:9) {
X[i+1:500,i+1] = X[i+1:500,i+1] - 5
}
X = rbind(X1,X2)
#target equal to sum of inverse distances to clusters
#I come up with some 2D 'disc' distance function.
#All discs of the clusters have orthogonal orientations,
#but are oriented in one dimension(i'th axis) towards the center.
y = apply(sapply(1:10, function(i) {
if(i==10) j=1 else j=i+1
1 / ((X[,i]-5)^2+(X[,j])^2+0.01) +
1 / ((X[,i]+5)^2+(X[,j])^2+0.01)
})
,1,sum)
rf = randomForest(X, y,ntree=20)
trainPred = predict(rf,X)
range(trainPred) #range of 10,000 train predictions
0.8077388 2.2239506
Xtest = data.frame(replicate(10, rnorm(500000,sd=4)))
testPred = predict(rf,Xtest)
range(testPred) #range of 500,000 test predictions
0.8433307 2.4964615
Best Answer
Use the Classifier. No, they are not both valid.
First, I really encourage you to read yourself into the topic of Regression vs Classification. Because using ML without knowing anything about it will give you wrong results which you won't realize. And that's quite dangerous... (it's a little bit like asking which way around you should hold your gun or if it doesn't matter)
Whether you use a classifier or a regressor only depends on the kind of problem you are solving. You have a binary classification problem, so use the classifier.
NO. You don't get probabilities from regression. It just tries to "extrapolate" the values you give (in this case only 0 and 1). This means values above 1 or below 0 are perfectly valid as a regression output as it does not expect only two discrete values as output (that's called classification!) but continuous values.
If you want to have the "probabilities" (be aware that these don't have to be well calibrated probabilities) for a certain point to belong to a certain class, train a classifier (so it learns to classify the data) and then use .predict_proba(), which then predicts the probability.
Just to mention it here: .predict vs .predict_proba (for a classifier!)
.predict just takes the .predict_proba output and changes everything to 0 below a certain threshold (usually 0.5) respectively to 1 above that threshold.
Remark: sure, internally, they are the very same except from the "last layer" etc.! Still, see them (or better the problem they are solving) as completely different!