[GIS] How to perform Random Forest land cover classification

land-classificationmachine learningrrandom forest

This is a follow-up to a previous post: Machine Learning Algorithms for Land Cover Classification.

It seems that the Random Forest (RF) classification method is gaining much momentum in the remote sensing world. I am particularly interested in RF due to many of its strengths:

  • A nonparametric approach suited to remote sensing data
  • High reported classification accuracy
  • Variable importance is reported

Given these strengths, I would like to perform Random Forest land classification using high resolution 4 band imagery. There is a lot of material and research touting the advantages of Random Forest, yet very little information exists on how to actually perform the classification analysis. I am familiar with RF regression using R and would prefer to use this environment to run the RF classification algorithm.

How do I collect, process and input training data (i.e. based on high resolution CIR aerial imagery) into the Random Forest algorithm using R? Any step-wise advice on how to produce a classified land cover raster would be greatly appreciated.

Best Answer

I am not sure that I understand what you mean by "collect" data. If you are referring to heads-up digitizing and assignment of classes, this is best done in a GIS. There are many free options that would be suitable (i..e, QGIS, GRASS). Ideally you would have field data to train your classification.

The procedure for classification using Random Forests is fairly straight forward. You can read in your training data (i.e., a point shapefile) using "rgdal" or "maptools", read in your spectral data using raster::stack, assign the raster values to your training points using raster:extract and then pass this to randomForest. You will need to coerce your "class" column into a factor to have RF recognize the model as a classification instance. Once you have a fit model you can use the predict function, passing it you raster stack. You will need to pass the standard arguments to predict in addition to ones specific to the raster predict function. The raster package has the ability to handle rasters "out of memory" and as such is memory safe, even with very large rasters. One of the arguments in the raster predict function is "filename" allowing for a raster to written to disk. For a multiclass problem you will need to set type="response" and index=1 which will output an integer raster of your classes.

There are a few caveats that should be noted:

  1. You cannot have more than 32 levels in your response variable (y) or any factor on the right side of the equation (x)
  2. Your classes must be balanced. A 30% rule is a good one to follow, that is if you have more than 30% more observations on one class than any other your problem becomes imbalanced and the results can be biased
  3. It is a misnomer that RF cannot overfit. If you over correlate your ensemble you can overfit the model. A good way to avoid this is to run a preliminary model and plot the error stabilization. As a rule of thumb, I choose 2X the number of bootstraps required to stabilize the error for the ntree parameter. This is because variable interaction stabilizes at a slower rate than error. If you are not including many variables in the model you can be much more conservative with this parameter.
  4. Do not use node purity as a measure of variable importance. It is not permuted like the mean decrease in accuracy.

I have functions for model selection, class imbalance and validation in the rfUtilities package available on CRAN.

Here is some simple code to get you started.

require(sp)
require(rgdal)
require(raster)
require(randomForest)

# CREATE LIST OF RASTERS
rlist=list.files(getwd(), pattern="img$", full.names=TRUE) 

# CREATE RASTER STACK
xvars <- stack(rlist)      

# READ POINT SHAPEFILE TRAINING DATA
sdata <- readOGR(dsn=getwd() layer=inshape)

# ASSIGN RASTER VALUES TO TRAINING DATA
v <- as.data.frame(extract(xvars, sdata))
  sdata@data = data.frame(sdata@data, v[match(rownames(sdata@data), rownames(v)),])

# RUN RF MODEL
rf.mdl <- randomForest(x=sdata@data[,3:ncol(sdata@data)], y=as.factor(sdata@data[,"train"]),
                       ntree=501, importance=TRUE)

# CHECK ERROR CONVERGENCE
plot(rf.mdl)

# PLOT mean decrease in accuracy VARIABLE IMPORTANCE
varImpPlot(rf.mdl, type=1)

# PREDICT MODEL
predict(xvars, rf.mdl, filename="RfClassPred.img", type="response", 
        index=1, na.rm=TRUE, progress="window", overwrite=TRUE)
Related Question