[GIS] Confusion Matrix R

accuracyconfusion matrixrrandom forestraster

I'm having some problems to understand the result of my confusion matrix. Here is my case:

I've run a classification (random forest) on a satellite image. To do so, I created 50 random points for training and 50 random points for validation for each class. There are 6 classes in total. The code I used to create the points for each one is:

# create points
points<-randomPoints(myraster, 100) 

#add projection
pointsB<-SpatialPoints(pointsB, crs("+proj=utm +zone=33 +datum=WGS84 +units=m   
+no_defs +ellps=WGS84 +towgs84=0,0,0")) 

#create a df with ID
newpoints<-data.frame(ID=1:100) 

# merge each point with an ID
pBdf<-SpatialPointsDataFrame(pointsB, newpoints) 

# Split df for training and validation
trainingB<-pBdf[1:50,]
testB<-pBdf[51:100,]

Once the point dataset is created for each class, I merge them:

trainlist<-list(trainingB,trainingR,trainingS,trainingBu,trainingO, trainingW)
trainingpoints<-do.call("rbind", trainlist) 

testlist<-list(testB,testR,testS,testBu,testO, testW)
testpoints<-do.call("rbind", testlist)

The output for testpoints is:

class       : SpatialPointsDataFrame 
features    : 300 
extent      : 379895, 390455, 6166685, 6173075  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=33 +datum=WGS84 +units=m +no_defs +ellps=WGS84     
+towgs84=0,0,0 
variables   : 1
names       :  ID 
min values  :  51 
max values  : 100

After the points are created and my classification is finished, here is how I created my confusion matrix

# Extract at test points the value of the classification
prediction<-extract(classification_raster, testpoints)
prediction<-unlist(prediction)
predictiontable<-as.data.frame(prediction)

# using the same test points extract pixel values from the reference data    
test<-extract(raster_Referecence, testpoints)
test<-unlist(test)
testtable<-as.data.frame(test)

confusionMatrix(data=predictiontable$prediction, reference=testtable$test)

When checking the predictiontable and the testtable, there are 50 points per class, however the confusion matrix output is:

          Reference
Prediction  1  2  3  4  5  6
         1 43  4  0  1  0  9
         2  4 28  5  0 20  6
         3  0  2 44  0  0  0
         4  2  1  0 49  0  3
         5  0 14  1  0 31  0
         6  1  1  0  0  0 31

As you can see some classes have only 33 points and others have 57. Should not it be 50 in total per row?

Any idea?

Best Answer

So for tl;dr answer to your question, No.

Long answer:

The 33..57; your rowsums, these are your models results. Notice that your colsums do add up to 50/class (except the last two, but I assume that you've made a transposition error some where. 49, 51 is close enough).

This implies that as you stated previously, you took a sample of 50 of each of your classes at points with known class identity. So you have 50 units of reference data for each class. You've compared this with your model prediction for the same units of data. If your model was a perfect model with 100% accuracy and precision, your row sums and colsums would all add up to be 50, and only your major diagonal would be populated with values. But this is the real world, so your model is confusing some results

Lets look at how well your model does at predicting class two. Your model predicted that 63/300 points were class two. So overall, it overestimated the amount of class two you should be finding. However, it only found 28 of the 50 points it would have found if it were a perfect model. This implies that not only is it overestimating class two, it also lacks precision with regard to finding class two.

In summary, your results look fine, you just need a bit of help with the interpretation, and you should probably figure out why you don't have the right number of reference points in classes 5 and 6. Other than that, this looks like exactly what you should expect from the kind of classification you are conducting.

Load libraries and example data:

# Load libraries
library('raster')
library('rgdal')

# Load a SpatialPolygonsDataFrame example
# Load Brazil administrative level 2 shapefile
BRA_adm2 <- raster::getData(country = "BRA", level = 2)

# Convert NAMES level 2 to factor 
BRA_adm2$NAME_2 <- as.factor(BRA_adm2$NAME_2)

# Plot BRA_adm2
plot(BRA_adm2)
box()

# Define RasterLayer object
r.raster <- raster()

# Define raster extent
extent(r.raster) <- extent(BRA_adm2)

# Define pixel size
res(r.raster) <- 0.1

Figure 1: Brazil SpatialPolygonsDataFrame plot

Simple thread example

# Simple thread -----------------------------------------------------------

# Rasterize
system.time(BRA_adm2.r <- rasterize(BRA_adm2, r.raster, 'NAME_2'))

Time in my laptop:

# Output:
# user  system elapsed 
# 23.883    0.010   23.891

Multithread thread example

# Multithread -------------------------------------------------------------

# Load 'parallel' package for support Parallel computation in R
library('parallel')

# Calculate the number of cores
no_cores <- detectCores() - 1

# Number of polygons features in SPDF
features <- 1:nrow(BRA_adm2[,])

# Split features in n parts
n <- 50
parts <- split(features, cut(features, n))

# Initiate cluster (after loading all the necessary object to R environment: BRA_adm2, parts, r.raster, n)
cl <- makeCluster(no_cores, type = "FORK")
print(cl)

# Parallelize rasterize function
system.time(rParts <- parLapply(cl = cl, X = 1:n, fun = function(x) rasterize(BRA_adm2[parts[[x]],], r.raster, 'NAME_2')))

# Finish
stopCluster(cl)

# Merge all raster parts
rMerge <- do.call(merge, rParts)

# Plot raster
plot(rMerge)

Figure 2: Brazil Raster plot

Time in my laptop:

# Output:
# user  system elapsed 
# 0.203   0.033   8.688

More info about parallelization in R:

Best Answer

Related Solutions

[GIS] Extract values from raster using point data with buffer in R

[GIS] Processing vector to raster faster with R

Load libraries and example data:

Simple thread example

Multithread thread example

Related Question