Solved – How does `predict.randomForest` estimate class probabilities

classificationpredictionrrandom forest

How does randomForest package estimate class probabilities when I use predict(model, data, type = "prob")?

I was using ranger for training random forests using the probability = T argument to predict probabilities. ranger says in documentation that it:

Grow a probability forest as in Malley et al. (2012).

I simulated some data and tried both packages and obtained very different results (see code below)

enter image description here

So I know that it uses a different technique (then ranger) to estimate probabilities. But which one?

simulate_data <- function(n){
  X <- data.frame(matrix(runif(n*10), ncol = 10))
  Y <- data.frame(Y = rbinom(n, size = 1, prob = apply(X, 1, sum) %>%
                               pnorm(mean = 5)
                             ) %>% 
                    as.factor()

  ) 
  dplyr::bind_cols(X, Y)
}

treino <- simulate_data(10000)
teste <- simulate_data(10000)

library(ranger)
modelo_ranger <- ranger(Y ~., data = treino, 
                                num.trees = 100, 
                                mtry = floor(sqrt(10)), 
                                write.forest = T, 
                                min.node.size = 100, 
                                probability = T
                                )

modelo_randomForest <- randomForest(Y ~., data = treino,
                                    ntree = 100, 
                                    mtry = floor(sqrt(10)),
                                    nodesize = 100
                                    )

pred_ranger <- predict(modelo_ranger, teste)$predictions[,1]
pred_randomForest <- predict(modelo_randomForest, teste, type = "prob")[,2]
prob_real <- apply(teste[,1:10], 1, sum) %>% pnorm(mean = 5)

data.frame(prob_real, pred_ranger, pred_randomForest) %>%
  tidyr::gather(pacote, prob, -prob_real) %>%
  ggplot(aes(x = prob, y = prob_real)) + geom_point(size = 0.1) + facet_wrap(~pacote)

Best Answer

It's just the proportion of votes of the trees in the ensemble.

library(randomForest)

rf = randomForest(Species~., data = iris, norm.votes = TRUE, proximity = TRUE)
p1 = predict(rf, iris, type = "prob")
p2 = predict(rf, iris, type = "vote", norm.votes = TRUE)

identical(p1,p2)
#[1] TRUE

Alternatively, if you multiply your probabilities by ntree, you get the same result, but now in counts instead of proportions.

p1 = predict(rf, iris, type = "prob")
p2 = predict(rf, iris, type = "vote", norm.votes = FALSE)

identical(500*p1,p2)
#[1] TRUE
Related Question