Solved – How does `predict.randomForest` estimate class probabilities

classificationpredictionrrandom forest

How does randomForest package estimate class probabilities when I use predict(model, data, type = "prob")?

I was using ranger for training random forests using the probability = T argument to predict probabilities. ranger says in documentation that it:

Grow a probability forest as in Malley et al. (2012).

I simulated some data and tried both packages and obtained very different results (see code below)

So I know that it uses a different technique (then ranger) to estimate probabilities. But which one?

simulate_data <- function(n){
  X <- data.frame(matrix(runif(n*10), ncol = 10))
  Y <- data.frame(Y = rbinom(n, size = 1, prob = apply(X, 1, sum) %>%
                               pnorm(mean = 5)
                             ) %>% 
                    as.factor()

  ) 
  dplyr::bind_cols(X, Y)
}

treino <- simulate_data(10000)
teste <- simulate_data(10000)

library(ranger)
modelo_ranger <- ranger(Y ~., data = treino, 
                                num.trees = 100, 
                                mtry = floor(sqrt(10)), 
                                write.forest = T, 
                                min.node.size = 100, 
                                probability = T
                                )

modelo_randomForest <- randomForest(Y ~., data = treino,
                                    ntree = 100, 
                                    mtry = floor(sqrt(10)),
                                    nodesize = 100
                                    )

pred_ranger <- predict(modelo_ranger, teste)$predictions[,1]
pred_randomForest <- predict(modelo_randomForest, teste, type = "prob")[,2]
prob_real <- apply(teste[,1:10], 1, sum) %>% pnorm(mean = 5)

data.frame(prob_real, pred_ranger, pred_randomForest) %>%
  tidyr::gather(pacote, prob, -prob_real) %>%
  ggplot(aes(x = prob, y = prob_real)) + geom_point(size = 0.1) + facet_wrap(~pacote)

Best Answer

It's just the proportion of votes of the trees in the ensemble.

library(randomForest)

rf = randomForest(Species~., data = iris, norm.votes = TRUE, proximity = TRUE)
p1 = predict(rf, iris, type = "prob")
p2 = predict(rf, iris, type = "vote", norm.votes = TRUE)

identical(p1,p2)
#[1] TRUE

Alternatively, if you multiply your probabilities by ntree, you get the same result, but now in counts instead of proportions.

p1 = predict(rf, iris, type = "prob")
p2 = predict(rf, iris, type = "vote", norm.votes = FALSE)

identical(500*p1,p2)
#[1] TRUE

Related Solutions

Solved – How to get class probabilities for unsupervised random forest

In unsupervised case randomForest produces a proximity matrix that you can use for clustering.

library(randomForest)
g <- randomForest(iris[,-5], keep.forest=FALSE, proximity=TRUE)
mds <- MDSplot(g, iris$Species, k=2, pch=16, palette=c("skyblue", "orange", "darkblue"))
library(cluster)
clusters_pam <- pam(1-g$proximity, k=3, diss = TRUE)
plot(mds$points[, 1], mds$points[, 2], pch=clusters_pam$clustering+14, col=c("skyblue", "orange", "darkblue")[as.numeric(iris$Species)])
legend("bottomleft", legend=unique(clusters_pam$clustering), pch = 15:17, title = "PAM cluster")
legend("topleft", legend=unique(iris$Species), pch = 16, col=c("skyblue", "orange", "darkblue"), title = "Iris species")

MDS stands for Multi-dimensional Scaling.

Of course the clusters won't one-on-one map to original classes (that's why I deliberately didn't remap clusters - so it's not a confusion matrix:

table(clusters_pam$clustering, iris$Species)

    setosa versicolor virginica
  1     50          0         0
  2      0          9        42
  3      0         41         8

Two dimensional MDS plot:

Then you can use your clusters as classes to train a supervised model:

g_new <- randomForest(x=iris[,-5], y=as.factor(clusters_pam$clustering), keep.forest=TRUE, proximity=TRUE)
table(predict(g_new, iris[,-5]), clusters_pam$clustering)

     1  2  3
  1 50  0  0
  2  0 51  0
  3  0  0 49

For the sake of our example and because Iris dataset is so short, we generate a simulated Iris dataset:

library(semiArtificial) # to generate dummy data for testing 
# create tree ensemble generator for classification problem
irisGenerator<- treeEnsemble(Species~., iris, noTrees=100)
# use the generator to create new data
irisNew <- newdata(irisGenerator, size=200)

Now we can predict on the new dataset and inspect how it is in agreement with the simulated dataset's species class:

table(predict(g_new, irisNew[,-5]), irisNew$Species)

    setosa versicolor virginica
  1     66          1         4
  2      1          7        56
  3      5         55         5

To predict probabilities:

predict(g_new, irisNew[,-5], type="prob")

        1     2     3
1   1.000 0.000 0.000
2   0.014 0.002 0.984
3   0.000 0.000 1.000
4   1.000 0.000 0.000
5   0.020 0.068 0.912
6   0.000 1.000 0.000
7   1.000 0.000 0.000
8   0.480 0.000 0.520
9   0.526 0.000 0.474
10  1.000 0.000 0.000

Solved – Is decision tree output a prediction or class probabilities

Just build the tree so that the leaves contain not just a single class estimate, but also a probability estimate as well. This could be done simply by running any standard decision tree algorithm, and running a bunch of data through it and counting what portion of the time the predicted label was correct in each leaf; this is what sklearn does. These are sometimes called "probability estimation trees," and though they don't give perfect probability estimates, they can be useful. There was a bunch of work investigating them in the early '00s, sometimes with fancier approaches, but the simple one in sklearn is decent for use in forests.

If you don't set max_depth or similar parameters, then the tree will always keep splitting until every leaf is pure, and all the probabilities will be 1 (as Soren says).

Note that this tree is not nondeterministic; rather, given an input, it deterministically produces both a class prediction and a confidence score in the form of a probability.

Verification that this is what's happening:

>>> import numpy as np
>>> from sklearn.tree import DecisionTreeClassifier
>>> from sklearn.datasets import make_classification
>>> from sklearn.cross_validation import train_test_split

>>> X, y = make_classification(n_informative=10, n_samples=1000)
>>> Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)

>>> clf = DecisionTreeClassifier(max_depth=5)
>>> clf.fit(Xtrain, ytrain)

>>> clf.predict_proba(Xtest[:5])
array([[ 0.19607843,  0.80392157],
       [ 0.9017094 ,  0.0982906 ],
       [ 0.9017094 ,  0.0982906 ],
       [ 0.02631579,  0.97368421],
       [ 0.9017094 ,  0.0982906 ]])

>>> from sklearn.utils import check_array
>>> from sklearn.tree.tree import DTYPE
>>> def get_node(X):
...     return clf.tree_.apply(check_array(X, dtype=DTYPE))
...
>>> node_idx, = get_node(Xtest[:1])
>>> ytrain[get_node(Xtrain) == node_idx].mean()
0.80392156862745101

(In the not-yet-released sklearn 0.17, this get_node helper can be replaced by clf.apply.)

Best Answer

Related Solutions

Solved – How to get class probabilities for unsupervised random forest

Solved – Is decision tree output a prediction or class probabilities

Related Question