Random Forests in unlabeled (unsupervised) mode does not return explicit classes but, rather something analogous to scaled multivariate distances which is based on node proximities. Without the proximity matrix, you do not have a usable unlabeled model. And yes, for large problems, even using a sparse matrix, the very nature of the approach causes the proximity matrix to get huge. This may very well be the reason that you have not seen published approaches using Random Forests in unsupervised remote sensing.
Based on the proximities an approach that I have seen, to derive/test clusters, is to use a modified K-means on the proximity matrix. Alternately, you may be able to trick the imputation function, using the random forests option, in the yaImpute package to perform a matrix imputation which would return something analogs to a k nearest neighbor (kNN) that could then be assigned to clusters based on a similarity matrix.
It is nothing near as straightforward as what you are thinking and I would encourage you to research this approach before jumping in with both feet.
**** Edit 12/14/2018
A few versions ago I added an unsupervised random forests function to the rfUtilities package. I would not recommend it on large data such as rasters but it is a useful clustering method. Here is a simple example.
library(rfUtilities)
library(sp)
data(meuse)
meuse <- na.omit(meuse)
n = 6
clust.meuse <- rf.unsupervised(meuse, n=n, proximity = TRUE,
silhouettes = TRUE)
( meuse$k <- clust.meuse$k )
mds <- stats:::cmdscale(clust.meuse$distances, eig=TRUE, k=n)
colnames(mds$points) <- paste("Dim", 1:n)
mds.col <- ifelse(clust.meuse$k == 1, rainbow(6)[1],
ifelse(clust.meuse$k == 2, rainbow(6)[2],
ifelse(clust.meuse$k == 3, rainbow(6)[3],
ifelse(clust.meuse$k == 4, rainbow(6)[4],
ifelse(clust.meuse$k == 5, rainbow(6)[5],
ifelse(clust.meuse$k == 6, rainbow(6)[6], NA))))))
plot(mds$points[,1:2],col=mds.col, pch=20)
pairs(mds$points, col=mds.col, pch=20)
coordinates(meuse) <- ~x+y
plot(meuse, col=mds.col, pch=19)
box()
I am the main developer of MGET.
The first step in your problem is to obtain values of the covariates that you will use to fit the model to your 90 GPS points. It sounds like you want to use the 8 bands as your covariates. You need to add 8 fields to your shapefile (one for each band) and populate them using a tool such as Extract Multi Values to Points from recent versions of ArcGIS or Interpolate Raster Values at Points from MGET (equivalent to what Arc provides but developed before the Arc tool existed).
After that, you need to fit a classification model to the GPS points, using the field containing the known cover type as the response variable and the 8 band fields as the covariates (a.k.a. predictor variables). After that you can obtain some performance statistics for your model and then predict it on rasters representing the covariates.
You can see a basic overview of MGET's modeling workflow for this here. The example is somewhat dated--not all of the tool parameters will look exactly like what you see there--but the basic workflow is the same: fit the model to a table of data, predict it against the table to get some performance statistics, and predict it on a stack of rasters. In MGET, the procedure is the same regardless of which modeling framework you use--MGET currently provides GLM, GAM, trees (a.k.a. CARTs), and random forest--so you can try different kinds of models with very similar workflows.
I'm sorry I don't have detailed instructions about this workflow written up. So far, we have not had funding to develop a complete manual. All MGET tools have documentation within ArcGIS, so be sure you click the Show Help >> button on the tool dialogs if you have not done so already.
Regarding Jeffrey Evans' speculation that MGET does not utilize the R raster package. That is correct. The code in MGET that performs raster predictions was developed before the raster package was released to CRAN (R's distribution system for R packages), thus it does not rely on that package. But it is not correct that MGET will crash due to memory limitations. MGET's raster prediction code was written specifically to handle the situation you're facing, by performing predictions in blocks similar to how the raster package does it. Prior to the raster package being developed, MGET was one of the only readily-available tools that could handle prediction of large rasters. MGET users have done this, for example, using large bathymetry rasters with 5m resolution.
All of that said, if you believe you will be performing a lot of modeling as your career progresses, I encourage you to learn how to do it in R directly, and about modeling and statistics more generally, independent of software. In a sense, MGET's modeling tools are a "gateway drug" to R. MGET's tools are just as robust as R--they utilize R to perform the actual model fitting and prediction--but they expose only a limited subset of what is possible in R itself. As you continue to do more modeling projects, eventually you may face a situation in which MGET is not enough and you need the full flexibility of R.
Best Answer
I am not sure that I understand what you mean by "collect" data. If you are referring to heads-up digitizing and assignment of classes, this is best done in a GIS. There are many free options that would be suitable (i..e, QGIS, GRASS). Ideally you would have field data to train your classification.
The procedure for classification using Random Forests is fairly straight forward. You can read in your training data (i.e., a point shapefile) using "rgdal" or "maptools", read in your spectral data using
raster::stack
, assign the raster values to your training points usingraster:extract
and then pass this torandomForest
. You will need to coerce your "class" column into a factor to have RF recognize the model as a classification instance. Once you have a fit model you can use the predict function, passing it you raster stack. You will need to pass the standard arguments to predict in addition to ones specific to the raster predict function. The raster package has the ability to handle rasters "out of memory" and as such is memory safe, even with very large rasters. One of the arguments in the raster predict function is "filename" allowing for a raster to written to disk. For a multiclass problem you will need to set type="response" and index=1 which will output an integer raster of your classes.There are a few caveats that should be noted:
I have functions for model selection, class imbalance and validation in the rfUtilities package available on CRAN.
Here is some simple code to get you started.