[GIS] Computing unsupervised random forest classification in R

classificationmachine learningrrandom forest

I want to compute an unsupervised random forest classification out of a raster stack in R. The raster stack represents the same extent in different spectral bands and as a result I want to obtain an unsupervised classification of the stack.
I am having problems with my code as my data is very huge.

Is it okay to just convert the stack into a dataframe in order to run the random forest algorithm like this?

stack_median <- stack(b1_mosaic_median, b2_mosaic_median, b3_mosaic_median, b4_mosaic_median, b5_mosaic_median, b7_mosaic_median)
stack_median_df <- as.data.frame(stack_median)

Here is the data as a csv file (https://www.dropbox.com/s/gkaryusnet46f0i/stack_median_df.csv?dl=0) – and you can read it in via:

stack_median_df<-read.csv(file="stack_median_df.csv")
stack_median_df<-stack_median_df[,-1]
stack_median_df_na <- na.omit(stack_median_df)

My next step would be the unsupervised classification:

median_rf <- randomForest(stack_median_df_na, importance=TRUE, proximity=FALSE, ntree=500, type=unsupervised, forest=NULL)

Due to my huge dataset a proximity measure can't be calculated (would need around 6000GB).

Do you know how to be able to have a look at the classification?

As predict(median_rf) and plot(median_rf) don't return anything.

Best Answer

Random Forests in unlabeled (unsupervised) mode does not return explicit classes but, rather something analogous to scaled multivariate distances which is based on node proximities. Without the proximity matrix, you do not have a usable unlabeled model. And yes, for large problems, even using a sparse matrix, the very nature of the approach causes the proximity matrix to get huge. This may very well be the reason that you have not seen published approaches using Random Forests in unsupervised remote sensing.

Based on the proximities an approach that I have seen, to derive/test clusters, is to use a modified K-means on the proximity matrix. Alternately, you may be able to trick the imputation function, using the random forests option, in the yaImpute package to perform a matrix imputation which would return something analogs to a k nearest neighbor (kNN) that could then be assigned to clusters based on a similarity matrix.

It is nothing near as straightforward as what you are thinking and I would encourage you to research this approach before jumping in with both feet.

**** Edit 12/14/2018 A few versions ago I added an unsupervised random forests function to the rfUtilities package. I would not recommend it on large data such as rasters but it is a useful clustering method. Here is a simple example.

library(rfUtilities)
library(sp)

data(meuse)
  meuse <- na.omit(meuse)

n = 6  
clust.meuse <- rf.unsupervised(meuse, n=n, proximity = TRUE, 
                               silhouettes = TRUE)
( meuse$k <- clust.meuse$k )

mds <- stats:::cmdscale(clust.meuse$distances, eig=TRUE, k=n)
  colnames(mds$points) <- paste("Dim", 1:n)
  mds.col <- ifelse(clust.meuse$k == 1, rainbow(6)[1],
               ifelse(clust.meuse$k == 2, rainbow(6)[2],
                 ifelse(clust.meuse$k == 3, rainbow(6)[3],
                   ifelse(clust.meuse$k == 4, rainbow(6)[4],
                    ifelse(clust.meuse$k == 5, rainbow(6)[5],
                     ifelse(clust.meuse$k == 6, rainbow(6)[6], NA))))))
plot(mds$points[,1:2],col=mds.col, pch=20)                 
pairs(mds$points, col=mds.col, pch=20)

coordinates(meuse) <- ~x+y
plot(meuse, col=mds.col, pch=19)
  box()