Solved – randomForest predict() for continuous variable: unexpected output

cartmachine learningrrandom forest

I am trying to understand how predict() in randomForest() in R computes the predicted values for a continuous y? My understanding is it should, for a single tree, for observation i, average over all observations falling in the same node than i, eventually removing the i observation itself. Is this correct?

However, doing this manually in R, I don't get the same result. I get actually results changing every time…

• i=1, t = 1: seems to remove last one, not i
• i=2, t = 1: seems to remove all, but i
• i=1, t = 2: seems to remove i (what I expect)

I don't understand this averaging of the trees?

Example with i=1, t=1

Example with data(swiss), predicting first observation (Courtelary), from first tree:

• my code, all obs in same node: 81.5
• my code, all obs in same node, except i: 81.93333
• predict(): returns 82.8
• my code, all obs in same node, except last: 82.8
library(randomForest)
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.

data(swiss)

## run forest
set.seed(111)
swiss.rf <- randomForest(Fertility ~ ., data=swiss)

## predict first obs, from first tree only:
i <- 1
t <-  1
pred_i_t <- predict(swiss.rf, newdata =swiss[i,], predict.all=TRUE)$individual[t] ## get node values, extract obs in that node nodes_tree1 <- attr(predict(swiss.rf, newdata =swiss, nodes = TRUE), "nodes")[,t] same_node <- swiss[nodes_tree1 == nodes_tree1[i],] same_node #> Fertility Agriculture Examination Education Catholic #> Courtelary 80.2 17.0 15 12 9.96 #> Moutier 85.8 36.5 12 7 33.77 #> Gruyere 82.4 53.3 12 7 97.67 #> Val de Ruz 77.6 37.6 15 7 4.97 #> Infant.Mortality #> Courtelary 22.2 #> Moutier 20.3 #> Gruyere 21.0 #> Val de Ruz 20.0 mean(same_node$$Fertility) #> [1] 81.5 mean(same_node$$Fertility[-i]) #> [1] 81.93333 mean(same_node$Fertility[-nrow(same_node)])
#> [1] 82.8
pred_i_t
#> [1] 82.8


Created on 2018-10-24 by the reprex package (v0.2.1)

I found the issue: the randomForest is also bootstrapping (bagging) the observations, and hence one needs to average over the resampled observations, not the initial sample ones.

So the averaging is done over the observations that were found in the node on a given bootstrap sample. This is obtained using the keep.inbag=TRUE argument. For this case i=1, t=1', it shows that (1,1,1,0), i.e. last observation was not drawn in that specific sample, which explains the result!

See code:

library(randomForest)
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.

data(swiss)

## run forest
set.seed(111)
swiss.rf <- randomForest(Fertility ~ ., data=swiss, keep.inbag=TRUE)

## predict first obs, from first tree only:
i <- 1
t <-  1
pred_i_t <- predict(swiss.rf, newdata =swiss[i,], predict.all=TRUE)$individual[t] ## get node values, extract obs in that node nodes_tree1 <- attr(predict(swiss.rf, newdata =swiss, nodes = TRUE), "nodes")[,t] bag_tree1 <- swiss.rf$inbag

swiss$$node <- nodes_tree1 swiss$$bag <-  swiss.rf\$inbag[, t]

obs_i_tree_t <- subset(swiss, node == nodes_tree1[i])

with(obs_i_tree_t, weighted.mean(Fertility, bag))
#> [1] 82.8
pred_i_t
#> [1] 82.8
`

Created on 2018-10-26 by the reprex package (v0.2.1)