Solved – `randomForest` predict() for continuous variable: unexpected output

cartmachine learningrrandom forest

I am trying to understand how predict() in randomForest() in R computes the predicted values for a continuous y? My understanding is it should, for a single tree, for observation i, average over all observations falling in the same node than i, eventually removing the i observation itself. Is this correct?

However, doing this manually in R, I don't get the same result. I get actually results changing every time…

  • i=1, t = 1: seems to remove last one, not i
  • i=2, t = 1: seems to remove all, but i
  • i=1, t = 2: seems to remove i (what I expect)

I don't understand this averaging of the trees?

Example with i=1, t=1

Example with data(swiss), predicting first observation (Courtelary), from first tree:

  • my code, all obs in same node: 81.5
  • my code, all obs in same node, except i: 81.93333
  • predict(): returns 82.8
  • my code, all obs in same node, except last: 82.8
library(randomForest)
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.

data(swiss)

## run forest
set.seed(111)
swiss.rf <- randomForest(Fertility ~ ., data=swiss)

## predict first obs, from first tree only:
i <- 1
t <-  1
pred_i_t <- predict(swiss.rf, newdata =swiss[i,], predict.all=TRUE)$individual[t]

## get node values, extract obs in that node
nodes_tree1 <- attr(predict(swiss.rf, newdata =swiss, nodes = TRUE), "nodes")[,t]
same_node <- swiss[nodes_tree1 == nodes_tree1[i],]

same_node
#>            Fertility Agriculture Examination Education Catholic
#> Courtelary      80.2        17.0          15        12     9.96
#> Moutier         85.8        36.5          12         7    33.77
#> Gruyere         82.4        53.3          12         7    97.67
#> Val de Ruz      77.6        37.6          15         7     4.97
#>            Infant.Mortality
#> Courtelary             22.2
#> Moutier                20.3
#> Gruyere                21.0
#> Val de Ruz             20.0

mean(same_node$Fertility)
#> [1] 81.5
mean(same_node$Fertility[-i])
#> [1] 81.93333
mean(same_node$Fertility[-nrow(same_node)])
#> [1] 82.8
pred_i_t
#> [1] 82.8

Created on 2018-10-24 by the reprex package (v0.2.1)

Best Answer

I found the issue: the randomForest is also bootstrapping (bagging) the observations, and hence one needs to average over the resampled observations, not the initial sample ones.

So the averaging is done over the observations that were found in the node on a given bootstrap sample. This is obtained using the keep.inbag=TRUE argument. For this case `i=1, t=1', it shows that (1,1,1,0), i.e. last observation was not drawn in that specific sample, which explains the result!

See code:

library(randomForest)
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.

data(swiss)

## run forest
set.seed(111)
swiss.rf <- randomForest(Fertility ~ ., data=swiss, keep.inbag=TRUE)


## predict first obs, from first tree only:
i <- 1
t <-  1
pred_i_t <- predict(swiss.rf, newdata =swiss[i,], predict.all=TRUE)$individual[t]



## get node values, extract obs in that node
nodes_tree1 <- attr(predict(swiss.rf, newdata =swiss, nodes = TRUE), "nodes")[,t]
bag_tree1 <- swiss.rf$inbag 


swiss$node <-  nodes_tree1
swiss$bag <-  swiss.rf$inbag[, t]

obs_i_tree_t <- subset(swiss, node == nodes_tree1[i])

with(obs_i_tree_t, weighted.mean(Fertility, bag))
#> [1] 82.8
pred_i_t
#> [1] 82.8

Created on 2018-10-26 by the reprex package (v0.2.1)