I am not sure that I understand what you mean by "collect" data. If you are referring to heads-up digitizing and assignment of classes, this is best done in a GIS. There are many free options that would be suitable (i..e, QGIS, GRASS). Ideally you would have field data to train your classification.
The procedure for classification using Random Forests is fairly straight forward. You can read in your training data (i.e., a point shapefile) using "rgdal" or "maptools", read in your spectral data using raster::stack
, assign the raster values to your training points using raster:extract
and then pass this to randomForest
. You will need to coerce your "class" column into a factor to have RF recognize the model as a classification instance. Once you have a fit model you can use the predict function, passing it you raster stack. You will need to pass the standard arguments to predict in addition to ones specific to the raster predict function. The raster package has the ability to handle rasters "out of memory" and as such is memory safe, even with very large rasters. One of the arguments in the raster predict function is "filename" allowing for a raster to written to disk. For a multiclass problem you will need to set type="response" and index=1 which will output an integer raster of your classes.
There are a few caveats that should be noted:
- You cannot have more than 32 levels in your response variable (y) or any
factor on the right side of the equation (x)
- Your classes must be balanced. A 30% rule is a good one to follow,
that is if you have more than 30% more observations on one class
than any other your problem becomes imbalanced and the results can be
biased
- It is a misnomer that RF cannot overfit. If you over correlate your
ensemble you can overfit the model. A good way to avoid this is to
run a preliminary model and plot the error stabilization. As a rule
of thumb, I choose 2X the number of bootstraps required to stabilize
the error for the ntree parameter. This is because variable
interaction stabilizes at a slower rate than error. If you are not
including many variables in the model you can be much more
conservative with this parameter.
- Do not use node purity as a measure of variable importance. It is
not permuted like the mean decrease in accuracy.
I have functions for model selection, class imbalance and validation in the rfUtilities package available on CRAN.
Here is some simple code to get you started.
require(sp)
require(rgdal)
require(raster)
require(randomForest)
# CREATE LIST OF RASTERS
rlist=list.files(getwd(), pattern="img$", full.names=TRUE)
# CREATE RASTER STACK
xvars <- stack(rlist)
# READ POINT SHAPEFILE TRAINING DATA
sdata <- readOGR(dsn=getwd() layer=inshape)
# ASSIGN RASTER VALUES TO TRAINING DATA
v <- as.data.frame(extract(xvars, sdata))
sdata@data = data.frame(sdata@data, v[match(rownames(sdata@data), rownames(v)),])
# RUN RF MODEL
rf.mdl <- randomForest(x=sdata@data[,3:ncol(sdata@data)], y=as.factor(sdata@data[,"train"]),
ntree=501, importance=TRUE)
# CHECK ERROR CONVERGENCE
plot(rf.mdl)
# PLOT mean decrease in accuracy VARIABLE IMPORTANCE
varImpPlot(rf.mdl, type=1)
# PREDICT MODEL
predict(xvars, rf.mdl, filename="RfClassPred.img", type="response",
index=1, na.rm=TRUE, progress="window", overwrite=TRUE)
Stay in raster format.
If ArcGIS would provide a zonal percentile function you could do this in a couple easy steps. Since it doesn't, you have to work around that limitation. There are many ways to proceed. The key ideas are:
Determine the proportion of cells in each class represented by the desired sample of 300.
By thresholding a grid of uniform random values, you can identify slightly more than the number of points needed in each class.
Extract these points, then post-process them by retaining only those having the lowest random values within each class.
This procedure obtains a simple random sample without replacement, and independently, of each land class.
Here are some details. Let there be N classes (n = 3 in the question), each with n(i) cells (i = 1, 2, ..., N). Suppose you intend to select k(i) randomly from each group (k(i) = 300 for the class indexes i = 1, 2, 3 in the question). You will do that by generating a uniform random grid and comparing it to a threshold grid [T]. It has the constant value p(i) at each cell in class i, with p(i) somewhere (to be chosen appropriately) between 0 and 1.
The result of this comparison in class i is a random count X(i) having a Binomial(n(i), p(i)) distribution. Its expectation therefore is m(i) = n(i)*p(i), which needs to be close to k(i). However, X(i) can randomly be less than that. Its variance is v(i) = n(i)*p(i)*(1-p(i)). To be really sure that X(i) will be at least k(i), we need a very small lower percentile of the distribution of X(i) to exceed k(i). You can work out what the resulting value of p(i) has to be by means of a statistical calculator, but an effective rule of thumb (for moderately small numbers of classes N) is to ensure that k(i) is less than several standard deviations less than m(i). Letting a reasonable starting value for "several" be L, we obtain the quadratic inequality
k(i) <= m(i) - L*sqrt(n(i)*p(i)*(1-p(i)))
whose approximate solution is
p(i) = (k(i)/n(i)) * (1 + z(i) + z(i)^2/2)
where
z(i) = L/sqrt(k(i)).
In the question, k(i) = 300 and it looks like the n(i) are in the millions. Taking L = 3, say, gives z(i) = 3 / sqrt(300), which is about 1/6 (and z(i)^2/2 is small enough to neglect) This tells us to ask for 1/6 more points than the desired proportion k(i)/n(i).
With this (simple) calculation out of the way,
Reclassify the land classes i into the values of the p(i) you computed. Call the resulting grid [P].
Create a uniform random grid [R].
Set the values of [R] to null wherever [R] > [P].
This last grid will have slightly more than X = k(1) + k(2) + ... + k(N) (= 900 in the question) cells that are not null. At this juncture use your favorite method (there are many that will work) to
combine it with the land class grid and
output all tuples (i, R(i), x coordinate, y coordinate) as a point shapefile.
Post-process the shapefile by retaining only the lowest k(i) values for each class i: that's the desired sample. If you are so extremely unlucky as to not obtain enough points in some class, just repeat the whole process from the beginning. (Consider increasing the value of L when you do so.)
Alternatively, doing the work in R
would be simple, because it offers functions to find the lowest k(i) random values in each class directly, rather than having to take several steps and convert things into a point shapefile format. The same conceptual workflow is involved.
Best Answer
I am the main developer of MGET.
The first step in your problem is to obtain values of the covariates that you will use to fit the model to your 90 GPS points. It sounds like you want to use the 8 bands as your covariates. You need to add 8 fields to your shapefile (one for each band) and populate them using a tool such as Extract Multi Values to Points from recent versions of ArcGIS or Interpolate Raster Values at Points from MGET (equivalent to what Arc provides but developed before the Arc tool existed).
After that, you need to fit a classification model to the GPS points, using the field containing the known cover type as the response variable and the 8 band fields as the covariates (a.k.a. predictor variables). After that you can obtain some performance statistics for your model and then predict it on rasters representing the covariates.
You can see a basic overview of MGET's modeling workflow for this here. The example is somewhat dated--not all of the tool parameters will look exactly like what you see there--but the basic workflow is the same: fit the model to a table of data, predict it against the table to get some performance statistics, and predict it on a stack of rasters. In MGET, the procedure is the same regardless of which modeling framework you use--MGET currently provides GLM, GAM, trees (a.k.a. CARTs), and random forest--so you can try different kinds of models with very similar workflows.
I'm sorry I don't have detailed instructions about this workflow written up. So far, we have not had funding to develop a complete manual. All MGET tools have documentation within ArcGIS, so be sure you click the Show Help >> button on the tool dialogs if you have not done so already.
Regarding Jeffrey Evans' speculation that MGET does not utilize the R raster package. That is correct. The code in MGET that performs raster predictions was developed before the raster package was released to CRAN (R's distribution system for R packages), thus it does not rely on that package. But it is not correct that MGET will crash due to memory limitations. MGET's raster prediction code was written specifically to handle the situation you're facing, by performing predictions in blocks similar to how the raster package does it. Prior to the raster package being developed, MGET was one of the only readily-available tools that could handle prediction of large rasters. MGET users have done this, for example, using large bathymetry rasters with 5m resolution.
All of that said, if you believe you will be performing a lot of modeling as your career progresses, I encourage you to learn how to do it in R directly, and about modeling and statistics more generally, independent of software. In a sense, MGET's modeling tools are a "gateway drug" to R. MGET's tools are just as robust as R--they utilize R to perform the actual model fitting and prediction--but they expose only a limited subset of what is possible in R itself. As you continue to do more modeling projects, eventually you may face a situation in which MGET is not enough and you need the full flexibility of R.