[GIS] How to perform Random Forest land cover classification

land-classificationmachine learningrrandom forest

This is a follow-up to a previous post: Machine Learning Algorithms for Land Cover Classification.

It seems that the Random Forest (RF) classification method is gaining much momentum in the remote sensing world. I am particularly interested in RF due to many of its strengths:

A nonparametric approach suited to remote sensing data
High reported classification accuracy
Variable importance is reported

Given these strengths, I would like to perform Random Forest land classification using high resolution 4 band imagery. There is a lot of material and research touting the advantages of Random Forest, yet very little information exists on how to actually perform the classification analysis. I am familiar with RF regression using R and would prefer to use this environment to run the RF classification algorithm.

How do I collect, process and input training data (i.e. based on high resolution CIR aerial imagery) into the Random Forest algorithm using R? Any step-wise advice on how to produce a classified land cover raster would be greatly appreciated.

Best Answer

I am not sure that I understand what you mean by "collect" data. If you are referring to heads-up digitizing and assignment of classes, this is best done in a GIS. There are many free options that would be suitable (i..e, QGIS, GRASS). Ideally you would have field data to train your classification.

The procedure for classification using Random Forests is fairly straight forward. You can read in your training data (i.e., a point shapefile) using "rgdal" or "maptools", read in your spectral data using raster::stack, assign the raster values to your training points using raster:extract and then pass this to randomForest. You will need to coerce your "class" column into a factor to have RF recognize the model as a classification instance. Once you have a fit model you can use the predict function, passing it you raster stack. You will need to pass the standard arguments to predict in addition to ones specific to the raster predict function. The raster package has the ability to handle rasters "out of memory" and as such is memory safe, even with very large rasters. One of the arguments in the raster predict function is "filename" allowing for a raster to written to disk. For a multiclass problem you will need to set type="response" and index=1 which will output an integer raster of your classes.

There are a few caveats that should be noted:

You cannot have more than 32 levels in your response variable (y) or any factor on the right side of the equation (x)
Your classes must be balanced. A 30% rule is a good one to follow, that is if you have more than 30% more observations on one class than any other your problem becomes imbalanced and the results can be biased
It is a misnomer that RF cannot overfit. If you over correlate your ensemble you can overfit the model. A good way to avoid this is to run a preliminary model and plot the error stabilization. As a rule of thumb, I choose 2X the number of bootstraps required to stabilize the error for the ntree parameter. This is because variable interaction stabilizes at a slower rate than error. If you are not including many variables in the model you can be much more conservative with this parameter.
Do not use node purity as a measure of variable importance. It is not permuted like the mean decrease in accuracy.

I have functions for model selection, class imbalance and validation in the rfUtilities package available on CRAN.

Here is some simple code to get you started.

require(sp)
require(rgdal)
require(raster)
require(randomForest)

# CREATE LIST OF RASTERS
rlist=list.files(getwd(), pattern="img$", full.names=TRUE) 

# CREATE RASTER STACK
xvars <- stack(rlist)      

# READ POINT SHAPEFILE TRAINING DATA
sdata <- readOGR(dsn=getwd() layer=inshape)

# ASSIGN RASTER VALUES TO TRAINING DATA
v <- as.data.frame(extract(xvars, sdata))
  sdata@data = data.frame(sdata@data, v[match(rownames(sdata@data), rownames(v)),])

# RUN RF MODEL
rf.mdl <- randomForest(x=sdata@data[,3:ncol(sdata@data)], y=as.factor(sdata@data[,"train"]),
                       ntree=501, importance=TRUE)

# CHECK ERROR CONVERGENCE
plot(rf.mdl)

# PLOT mean decrease in accuracy VARIABLE IMPORTANCE
varImpPlot(rf.mdl, type=1)

# PREDICT MODEL
predict(xvars, rf.mdl, filename="RfClassPred.img", type="response", 
        index=1, na.rm=TRUE, progress="window", overwrite=TRUE)

Related Solutions

[GIS] Computing unsupervised random forest classification in R

Random Forests in unlabeled (unsupervised) mode does not return explicit classes but, rather something analogous to scaled multivariate distances which is based on node proximities. Without the proximity matrix, you do not have a usable unlabeled model. And yes, for large problems, even using a sparse matrix, the very nature of the approach causes the proximity matrix to get huge. This may very well be the reason that you have not seen published approaches using Random Forests in unsupervised remote sensing.

Based on the proximities an approach that I have seen, to derive/test clusters, is to use a modified K-means on the proximity matrix. Alternately, you may be able to trick the imputation function, using the random forests option, in the yaImpute package to perform a matrix imputation which would return something analogs to a k nearest neighbor (kNN) that could then be assigned to clusters based on a similarity matrix.

It is nothing near as straightforward as what you are thinking and I would encourage you to research this approach before jumping in with both feet.

**** Edit 12/14/2018 A few versions ago I added an unsupervised random forests function to the rfUtilities package. I would not recommend it on large data such as rasters but it is a useful clustering method. Here is a simple example.

library(rfUtilities)
library(sp)

data(meuse)
  meuse <- na.omit(meuse)

n = 6  
clust.meuse <- rf.unsupervised(meuse, n=n, proximity = TRUE, 
                               silhouettes = TRUE)
( meuse$k <- clust.meuse$k )

mds <- stats:::cmdscale(clust.meuse$distances, eig=TRUE, k=n)
  colnames(mds$points) <- paste("Dim", 1:n)
  mds.col <- ifelse(clust.meuse$k == 1, rainbow(6)[1],
               ifelse(clust.meuse$k == 2, rainbow(6)[2],
                 ifelse(clust.meuse$k == 3, rainbow(6)[3],
                   ifelse(clust.meuse$k == 4, rainbow(6)[4],
                    ifelse(clust.meuse$k == 5, rainbow(6)[5],
                     ifelse(clust.meuse$k == 6, rainbow(6)[6], NA))))))
plot(mds$points[,1:2],col=mds.col, pch=20)                 
pairs(mds$points, col=mds.col, pch=20)

coordinates(meuse) <- ~x+y
plot(meuse, col=mds.col, pch=19)
  box()

[GIS] Random Forest land cover classification in ArcMap

I am the main developer of MGET.

The first step in your problem is to obtain values of the covariates that you will use to fit the model to your 90 GPS points. It sounds like you want to use the 8 bands as your covariates. You need to add 8 fields to your shapefile (one for each band) and populate them using a tool such as Extract Multi Values to Points from recent versions of ArcGIS or Interpolate Raster Values at Points from MGET (equivalent to what Arc provides but developed before the Arc tool existed).

After that, you need to fit a classification model to the GPS points, using the field containing the known cover type as the response variable and the 8 band fields as the covariates (a.k.a. predictor variables). After that you can obtain some performance statistics for your model and then predict it on rasters representing the covariates.

You can see a basic overview of MGET's modeling workflow for this here. The example is somewhat dated--not all of the tool parameters will look exactly like what you see there--but the basic workflow is the same: fit the model to a table of data, predict it against the table to get some performance statistics, and predict it on a stack of rasters. In MGET, the procedure is the same regardless of which modeling framework you use--MGET currently provides GLM, GAM, trees (a.k.a. CARTs), and random forest--so you can try different kinds of models with very similar workflows.

I'm sorry I don't have detailed instructions about this workflow written up. So far, we have not had funding to develop a complete manual. All MGET tools have documentation within ArcGIS, so be sure you click the Show Help >> button on the tool dialogs if you have not done so already.

Regarding Jeffrey Evans' speculation that MGET does not utilize the R raster package. That is correct. The code in MGET that performs raster predictions was developed before the raster package was released to CRAN (R's distribution system for R packages), thus it does not rely on that package. But it is not correct that MGET will crash due to memory limitations. MGET's raster prediction code was written specifically to handle the situation you're facing, by performing predictions in blocks similar to how the raster package does it. Prior to the raster package being developed, MGET was one of the only readily-available tools that could handle prediction of large rasters. MGET users have done this, for example, using large bathymetry rasters with 5m resolution.

All of that said, if you believe you will be performing a lot of modeling as your career progresses, I encourage you to learn how to do it in R directly, and about modeling and statistics more generally, independent of software. In a sense, MGET's modeling tools are a "gateway drug" to R. MGET's tools are just as robust as R--they utilize R to perform the actual model fitting and prediction--but they expose only a limited subset of what is possible in R itself. As you continue to do more modeling projects, eventually you may face a situation in which MGET is not enough and you need the full flexibility of R.

Best Answer

Related Solutions

[GIS] Computing unsupervised random forest classification in R

[GIS] Random Forest land cover classification in ArcMap

Related Question