I am not sure that I understand what you mean by "collect" data. If you are referring to heads-up digitizing and assignment of classes, this is best done in a GIS. There are many free options that would be suitable (i..e, QGIS, GRASS). Ideally you would have field data to train your classification.
The procedure for classification using Random Forests is fairly straight forward. You can read in your training data (i.e., a point shapefile) using "rgdal" or "maptools", read in your spectral data using raster::stack
, assign the raster values to your training points using raster:extract
and then pass this to randomForest
. You will need to coerce your "class" column into a factor to have RF recognize the model as a classification instance. Once you have a fit model you can use the predict function, passing it you raster stack. You will need to pass the standard arguments to predict in addition to ones specific to the raster predict function. The raster package has the ability to handle rasters "out of memory" and as such is memory safe, even with very large rasters. One of the arguments in the raster predict function is "filename" allowing for a raster to written to disk. For a multiclass problem you will need to set type="response" and index=1 which will output an integer raster of your classes.
There are a few caveats that should be noted:
- You cannot have more than 32 levels in your response variable (y) or any
factor on the right side of the equation (x)
- Your classes must be balanced. A 30% rule is a good one to follow,
that is if you have more than 30% more observations on one class
than any other your problem becomes imbalanced and the results can be
biased
- It is a misnomer that RF cannot overfit. If you over correlate your
ensemble you can overfit the model. A good way to avoid this is to
run a preliminary model and plot the error stabilization. As a rule
of thumb, I choose 2X the number of bootstraps required to stabilize
the error for the ntree parameter. This is because variable
interaction stabilizes at a slower rate than error. If you are not
including many variables in the model you can be much more
conservative with this parameter.
- Do not use node purity as a measure of variable importance. It is
not permuted like the mean decrease in accuracy.
I have functions for model selection, class imbalance and validation in the rfUtilities package available on CRAN.
Here is some simple code to get you started.
require(sp)
require(rgdal)
require(raster)
require(randomForest)
# CREATE LIST OF RASTERS
rlist=list.files(getwd(), pattern="img$", full.names=TRUE)
# CREATE RASTER STACK
xvars <- stack(rlist)
# READ POINT SHAPEFILE TRAINING DATA
sdata <- readOGR(dsn=getwd() layer=inshape)
# ASSIGN RASTER VALUES TO TRAINING DATA
v <- as.data.frame(extract(xvars, sdata))
sdata@data = data.frame(sdata@data, v[match(rownames(sdata@data), rownames(v)),])
# RUN RF MODEL
rf.mdl <- randomForest(x=sdata@data[,3:ncol(sdata@data)], y=as.factor(sdata@data[,"train"]),
ntree=501, importance=TRUE)
# CHECK ERROR CONVERGENCE
plot(rf.mdl)
# PLOT mean decrease in accuracy VARIABLE IMPORTANCE
varImpPlot(rf.mdl, type=1)
# PREDICT MODEL
predict(xvars, rf.mdl, filename="RfClassPred.img", type="response",
index=1, na.rm=TRUE, progress="window", overwrite=TRUE)
I found a tutorial here
But it is not that helpful, as when I am in the preparation of reference data (join attributes by location), it results in shapefile, and there is no XML file. Meanwhile, the next step requires an XML file.
Still stuck.
In the end, I classified them based on ruleset like eCognition, but have to write the script down in field calculator
The XML file in the next step is for output ... doc. says:
"Output XML file: XML filename where the statistics are saved for future reuse."
SORRY, this manual is of different tool... really seems, that something is missing in that tutorial. I will try to use scikit-learn. As I have now layer with segments and its features (in 4th step of segmentation I have used as an input stack of layers, which I want to use as classification features). I will report here.
Best Answer
Classification algorithms such as Maximum Liklihood, random forests, and SVM are statistical methods for grouping data. These data may be words, colors, sounds or anything you can imagine. In a remote sensing context, these algorithms are used to group pixels or image objects (segments) based on statistical properties, or spectral profiles.
To answer the first part of your question, all three of these algorithms can be used to classify image objects (e.g. segments created in Matlab or eCognition). Since these image objects, or segments, are essentially created by drawing a line around statistically similar groups of pixels, these segments can be classified into further classes too (e.g. forest, grassland, etc) if you create a set of rules or statistical properties deciding which objects are grouped together.
For the second part of the question, all three of these algorithms can also be used as pixel-based classifiers. The same principle holds true for classifying pixels as it does image objects or segments; the specific algorithm determines how the pixels are grouped together based on a given set of statistical rules.
From a software point of view, you can implement these classification algorithms at the pixel level or the image object level in software such as eCognition. You can also implement an object-based classification on image objects, or a pixel-based classification within image objects.