[GIS] Random Forest out of the bag and confusion matrix

classificationconfusion matrixrrandom forestremote sensing

I have implemented Random Forest classifier to classify remote sensing data in R. The original code comes from here:

How to perform Random Forest land classification?

Everything works fine, but what I need is to obtain confusion matrix and out of the bag error for fast classification accuracy assessment. I have created junk dataset for validating functionality of the whole script itself (only 2 classes). I added parallel computing to speed up computations.

I am wondering if my implementation is correct and If I will be able to obtain confusion matrix and out of the bag error when I will add much more features describing classes later. Here is part of my code:

sdata <- readOGR(dsn = vyber_shp, layer = "trenink")

# extraction from raster to be classified

rdata <- data.frame(extract(r, sdata))

# create random forest model in parallel

cl <- makeCluster(detectCores())

registerDoParallel(cl)

rf.mdl <- foreach(ntree=rep(125, 4), .combine = combine, .packages = 'randomForest') 
%dopar% {randomForest(x=rdata, y=sdata$Class, ntree=500, 
proximity=TRUE, importance=TRUE, confusion=TRUE, 
do.trace=TRUE, err.rate=TRUE) 
    }

stopCluster(cl)

# classify raster in parallel

beginCluster()

predikce <- clusterR(r, raster::predict, args=list(model=rf.mdl))

endCluster()

# write and save result of classification as raster image

klasifikace <- writeRaster(predikce, filename='Klasifikace_RandomForest', 
format='HFA', options='INTERLEAVE=BSQ', datatype='INT2S', overwrite=TRUE)

varImpPlot(rf.mdl)

rf.mdl$confusion

Best Answer

The combine function (which is used in foreach) does not store the relevant components into the final randomForest object. See ?randomForest::combine:

The confusion, err.rate, mse and rsq components (as well as the corresponding components in the test component, if exist) of the combined object will be NULL.

But the predict method returns OOB predictions if the newdata argument is omitted:

library("doSNOW")
library("foreach")

cl <- makeCluster(2, type="SOCK")
registerDoSNOW(cl)

set.seed(1)
x <- subset(iris, select=-Species)
y <- iris$Species

rf <- foreach(ntree=rep(250, 4), .combine=combine, .packages="randomForest") %dopar%
      randomForest(x=x, y=y, ntree=ntree, norm.votes=FALSE)

rf
# Call:
#  randomForest(x = x, y = y, ntree = ntree, norm.votes = FALSE) 
#                Type of random forest: classification
#                      Number of trees: 1000
# No. of variables tried at each split: 2

# `predict.randomForest` returns OOB predictions if `newdata` is not given
rf_pred <- predict(rf)

caret::confusionMatrix(rf_pred, y)
# Confusion Matrix and Statistics
#
#             Reference
# Prediction   setosa versicolor virginica
#   setosa         50          0         0
#   versicolor      0         47         3
#   virginica       0          3        47
#
# Overall Statistics
#
#                Accuracy : 0.96
#                  95% CI : (0.915, 0.985)
#     No Information Rate : 0.333
#     P-Value [Acc > NIR] : <2e-16
#
#                   Kappa : 0.94
#  Mcnemar's Test P-Value : NA
#
# Statistics by Class:
#
#                      Class: setosa Class: versicolor Class: virginica
# Sensitivity                  1.000             0.940            0.940
# Specificity                  1.000             0.970            0.970
# Pos Pred Value               1.000             0.940            0.940
# Neg Pred Value               1.000             0.970            0.970
# Prevalence                   0.333             0.333            0.333
# Detection Rate               0.333             0.313            0.313
# Detection Prevalence         0.333             0.333            0.333
# Balanced Accuracy            1.000             0.955            0.955

See also:

Related Question