Solved – Posterior probabilities with decision trees or decision forests

bayesianboostingcalibrationcartposterior

Is there a way to get posterior probabilities $P(C | \vec{x})$ (probability that a data item $\vec{x}$ belong to one of the given classes) in a multiclass classification problem using decision trees or forests?

I found some hints about using calibration methods (e.g. Platt's method or Isotonic Regression) in combination with boosted or bagged trees. However, as I'm not experienced in this field I can't find a good explanation how this works. It would be very helpful if one of you could explain me the general idea how to get posterior probabilities with decision trees or a good link or paper where these things are explained.

Best Answer

This is explained in section 2.2.4 of the paper Decision Forests for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning.

In the case of classification each leaf may store the empirical distribution over the classes associated to the subset of training data that has reached that leaf.

During testing, each tree leaf yields a distribution over classes and the forest output is the average of these leaf distributions.

Related Solutions

Solved – Fully grown decision trees in random forests

Yes, even a single A-outlier (sample of class A) placed in the middle of many B examples (in a feature space) would affect the structure of the forest. The trained forest model will predict new samples as A, when these are placed very close to the A-outlier. But the density of neighboring B examples will decrease size of this "predict A"-island in the "predict B"-ocean. But the "predict A"-island will not disappear. For noisy classification, e.g. Contraceptive Method Choice, the default random forest can be improved by lowering tried varaibles in each split(mtry) and bootstrap sample size(sampsize). If sampsize is e.g. 40% of training size, then any 'single sample prediction island' will completely drown if surrounded by only counter examples, as it only will be present in 40% of the trees.

EDIT: If sample replacement is true, then more like 33% of trees.

mean(replicate(1000,length(unique(sample(1:1000,400,rep=T)))))

I made a simulation of the problem (A=TRUE,B=FALSE) where one (A/TRUE)sample is injected within many (B/FALSE) samples. Hereby is created a tiny A-island in the B ocean. The area of the A-island is so small, it has no influence on the overall prediction performance. Lowering sample size makes the island disappear.

1000 samples with two features $X_1$ and $X_2$ are of class "true/A" if $y_i = X_1 + X_2 >= 0$. Features are drawn from $N(0,1)$

library(randomForest)
par(mfrow=c(2,2))
set.seed(1)

#make data
X = data.frame(replicate(2,rnorm(1000)))
y = factor(apply(X,1,sum) >=0) #create 2D class problem
X[1,] = c(-1,-1); y[1]='TRUE' #insert one TRUE outlier inside 'FALSE-land'

#train default forest
rf = randomForest(X,y)

#make test grid(250X250) from -3 to 3,
Xtest = expand.grid(replicate(2,seq(-3,3,le=250),simplify = FALSE))
Xtest[1,] = c(-1,-1) #insert the exact same coordinate of train outlier
Xtest = data.frame(Xtest); names(Xtest) = c("X1","X2")
plot(Xtest,col=predict(rf,Xtest),pch=15,cex=0.5,main="regular RF, red=TRUE")

#zoom in on area surrouding outlier
Xtest = expand.grid(replicate(2,seq(-1.05,-.95,le=250),simplify = FALSE))
Xtest = data.frame(Xtest); names(Xtest) = c("X1","X2")
plot(Xtest,col=predict(rf,Xtest),pch=15,cex=0.5,main= "regular RF, zoom, red=TRUE")

#train extra robust RF
rf = randomForest(X,y,sampsize = 400)
Xtest = expand.grid(replicate(2,seq(-3,3,le=250),simplify = FALSE))
Xtest[1,] = c(-1,-1)
Xtest = data.frame(Xtest); names(Xtest) = c("X1","X2")
plot(Xtest,col=predict(rf,Xtest),pch=15,cex=0.5,main="RF, sampsize=400, red=TRUE")

Xtest = expand.grid(replicate(2,seq(-1.05,-.95,le=250),simplify = FALSE))
Xtest = data.frame(Xtest); names(Xtest) = c("X1","X2")
plot(Xtest,col=predict(rf,Xtest),pch=15,cex=0.5,main="RF, sampsize=400, zoom, red=TRUE")

Machine Learning Calibration – When Should a Classifier Predicting Probability Be Calibrated?

There are several possible scenarios when one would think about calibrating probabilities:

The model is misspecified or not optimally trained. That will be the case when non-linear relationships are modeled with a linear learner; or model is too rigid due to excessive regularization (model underfits); or to the contrary, the model is too flexible (overfit or data memorization). Under/over-fit may also be due to having too few/many learning epochs or bagged trees.
A wrong objective function for predicting probabilities was chosen. That will be the case for distorted probabilities predicted by sklearn RandomForest, where they use "gini" or "entropy" for objective function. Classifiers with logloss as objective function are supposed to produce non-biased probability estimates given they have enough data to learn from. Onesidedness of probabilities may explain probability distortions near interval end, but that would not explain distortions in the middle.
Using optimized objective function instead of exact one. This is the case with XGBoost Random Forest implementation:

XGBoost uses 2nd order approximation to the objective function. This can lead to results that differ from a random forest implementation that uses the exact value of the objective function.

In all three scenarios (including having too little data, where it's unclear if calibration results will generalize well when new data arrives), calibration time is better spent on (1) correct model specification (2) choosing right metric (objective function) to optimize (3) collecting more data.

Case 1. Right objective, enough data (Logistic Regression from sklearn)

Case 2. Wrong objective (Random Forest from sklearn)

Case 3. Right objective, needs more data (Random Forest from XGBoost)

Case 4. Right objective, needs more data

There is nothing wrong with the "wrong" metric as all the classifiers, including sklearn's Random Forest, Naïve Bayes, or SVC, perform very well for certain tasks. The expectation these will behave well for predicting probabilities is wrong and misspecified.
Calibrating well specified and well trained models, even though may show better in-sample result, most probably is fitting to test and will lead to worse generalization to new data.

Best Answer

Related Solutions

Solved – Fully grown decision trees in random forests

Machine Learning Calibration – When Should a Classifier Predicting Probability Be Calibrated?

Related Question