[GIS] Computing unsupervised random forest classification in R

classificationmachine learningrrandom forest

I want to compute an unsupervised random forest classification out of a raster stack in R. The raster stack represents the same extent in different spectral bands and as a result I want to obtain an unsupervised classification of the stack.
I am having problems with my code as my data is very huge.

Is it okay to just convert the stack into a dataframe in order to run the random forest algorithm like this?

stack_median <- stack(b1_mosaic_median, b2_mosaic_median, b3_mosaic_median, b4_mosaic_median, b5_mosaic_median, b7_mosaic_median)
stack_median_df <- as.data.frame(stack_median)

Here is the data as a csv file (https://www.dropbox.com/s/gkaryusnet46f0i/stack_median_df.csv?dl=0) – and you can read it in via:

stack_median_df<-read.csv(file="stack_median_df.csv")
stack_median_df<-stack_median_df[,-1]
stack_median_df_na <- na.omit(stack_median_df)

My next step would be the unsupervised classification:

median_rf <- randomForest(stack_median_df_na, importance=TRUE, proximity=FALSE, ntree=500, type=unsupervised, forest=NULL)

Due to my huge dataset a proximity measure can't be calculated (would need around 6000GB).

Do you know how to be able to have a look at the classification?

As predict(median_rf) and plot(median_rf) don't return anything.

Best Answer

Random Forests in unlabeled (unsupervised) mode does not return explicit classes but, rather something analogous to scaled multivariate distances which is based on node proximities. Without the proximity matrix, you do not have a usable unlabeled model. And yes, for large problems, even using a sparse matrix, the very nature of the approach causes the proximity matrix to get huge. This may very well be the reason that you have not seen published approaches using Random Forests in unsupervised remote sensing.

Based on the proximities an approach that I have seen, to derive/test clusters, is to use a modified K-means on the proximity matrix. Alternately, you may be able to trick the imputation function, using the random forests option, in the yaImpute package to perform a matrix imputation which would return something analogs to a k nearest neighbor (kNN) that could then be assigned to clusters based on a similarity matrix.

It is nothing near as straightforward as what you are thinking and I would encourage you to research this approach before jumping in with both feet.

**** Edit 12/14/2018 A few versions ago I added an unsupervised random forests function to the rfUtilities package. I would not recommend it on large data such as rasters but it is a useful clustering method. Here is a simple example.

library(rfUtilities)
library(sp)

data(meuse)
  meuse <- na.omit(meuse)

n = 6  
clust.meuse <- rf.unsupervised(meuse, n=n, proximity = TRUE, 
                               silhouettes = TRUE)
( meuse$k <- clust.meuse$k )

mds <- stats:::cmdscale(clust.meuse$distances, eig=TRUE, k=n)
  colnames(mds$points) <- paste("Dim", 1:n)
  mds.col <- ifelse(clust.meuse$k == 1, rainbow(6)[1],
               ifelse(clust.meuse$k == 2, rainbow(6)[2],
                 ifelse(clust.meuse$k == 3, rainbow(6)[3],
                   ifelse(clust.meuse$k == 4, rainbow(6)[4],
                    ifelse(clust.meuse$k == 5, rainbow(6)[5],
                     ifelse(clust.meuse$k == 6, rainbow(6)[6], NA))))))
plot(mds$points[,1:2],col=mds.col, pch=20)                 
pairs(mds$points, col=mds.col, pch=20)

coordinates(meuse) <- ~x+y
plot(meuse, col=mds.col, pch=19)
  box()

Related Solutions

[GIS] How to perform Random Forest land cover classification

I am not sure that I understand what you mean by "collect" data. If you are referring to heads-up digitizing and assignment of classes, this is best done in a GIS. There are many free options that would be suitable (i..e, QGIS, GRASS). Ideally you would have field data to train your classification.

The procedure for classification using Random Forests is fairly straight forward. You can read in your training data (i.e., a point shapefile) using "rgdal" or "maptools", read in your spectral data using raster::stack, assign the raster values to your training points using raster:extract and then pass this to randomForest. You will need to coerce your "class" column into a factor to have RF recognize the model as a classification instance. Once you have a fit model you can use the predict function, passing it you raster stack. You will need to pass the standard arguments to predict in addition to ones specific to the raster predict function. The raster package has the ability to handle rasters "out of memory" and as such is memory safe, even with very large rasters. One of the arguments in the raster predict function is "filename" allowing for a raster to written to disk. For a multiclass problem you will need to set type="response" and index=1 which will output an integer raster of your classes.

There are a few caveats that should be noted:

You cannot have more than 32 levels in your response variable (y) or any factor on the right side of the equation (x)
Your classes must be balanced. A 30% rule is a good one to follow, that is if you have more than 30% more observations on one class than any other your problem becomes imbalanced and the results can be biased
It is a misnomer that RF cannot overfit. If you over correlate your ensemble you can overfit the model. A good way to avoid this is to run a preliminary model and plot the error stabilization. As a rule of thumb, I choose 2X the number of bootstraps required to stabilize the error for the ntree parameter. This is because variable interaction stabilizes at a slower rate than error. If you are not including many variables in the model you can be much more conservative with this parameter.
Do not use node purity as a measure of variable importance. It is not permuted like the mean decrease in accuracy.

I have functions for model selection, class imbalance and validation in the rfUtilities package available on CRAN.

Here is some simple code to get you started.

require(sp)
require(rgdal)
require(raster)
require(randomForest)

# CREATE LIST OF RASTERS
rlist=list.files(getwd(), pattern="img$", full.names=TRUE) 

# CREATE RASTER STACK
xvars <- stack(rlist)      

# READ POINT SHAPEFILE TRAINING DATA
sdata <- readOGR(dsn=getwd() layer=inshape)

# ASSIGN RASTER VALUES TO TRAINING DATA
v <- as.data.frame(extract(xvars, sdata))
  sdata@data = data.frame(sdata@data, v[match(rownames(sdata@data), rownames(v)),])

# RUN RF MODEL
rf.mdl <- randomForest(x=sdata@data[,3:ncol(sdata@data)], y=as.factor(sdata@data[,"train"]),
                       ntree=501, importance=TRUE)

# CHECK ERROR CONVERGENCE
plot(rf.mdl)

# PLOT mean decrease in accuracy VARIABLE IMPORTANCE
varImpPlot(rf.mdl, type=1)

# PREDICT MODEL
predict(xvars, rf.mdl, filename="RfClassPred.img", type="response", 
        index=1, na.rm=TRUE, progress="window", overwrite=TRUE)

[GIS] I don’t find the mistake in the RandomForest-classification

I think your problem is related to the layer names of your raster stack ('rasters'). Make sure these are the same as in your .csv. You can get the layer names with names(rasters) and set them with names(rasters) <- c("band1", "band2", "band3")

Hope this helps TimSalabim

Best Answer

Related Solutions

[GIS] How to perform Random Forest land cover classification

[GIS] I don’t find the mistake in the RandomForest-classification

Related Question