[GIS] Rf-Classification seems to work but gives an error 15 seconds later

classificationimageryrandom forest

This is my code.
It classifies an imagery-stack(xvars) and reads a shapefile with training-points.

If I run single rows they work. Even if I run them all, they work and start to predict but after 15s it stops working and gives an error.

As far as I understand, the classification doesn't need the raster-names because they are read automatically.

My training-shapefile (sw_trainshape.shp) inherits a table with columns

FID    Shape    Class      ObjectID       x              y
 1     Point     bush         1      481791,2429   5626286,6397
 2      ...       ...        ...            ...

My tif-files are named band1,band2,band3 and so on. I have 6 bands.

Yet I'm not experienced enough, so I could use some help as to why my code doesn't classifies.

ERROR:

"Loading required package: tcltk

Error in predict.randomForest(model, blockvals, …) :
variables in the training data missing in newdata"

Code:

setwd("D:/BA-Workspace/DOP_10/orthophotos_abcd/R/test_run_R/test_other")


library(sp)
library(rgdal)
library(raster)
library(randomForest)

# create list of rasters
rlist=list.files(getwd(), pattern="tif$", full.names=TRUE) 

# CREATE RASTER STACK
xvars <- stack(rlist)      

# READ Raster TRAINING DATA
sdata <- readOGR(dsn=getwd(), layer="sw_trainshape")

# ASSIGN RASTER VALUES TO TRAINING DATA
v <- as.data.frame(extract(xvars, sdata))
sdata@data = data.frame(sdata@data, v[match(rownames(sdata@data), rownames(v)),])

# RUN RF MODEL
rf.mdl <- randomForest(x=sdata@data[,3:ncol(sdata@data)],     y=as.factor(sdata@data[,"Class"]),
                   ntree=501, importance=TRUE)

# CHECK ERROR CONVERGENCE
#plot(rf.mdl)

# PLOT mean decrease in accuracy VARIABLE IMPORTANCE
#varImpPlot(rf.mdl, type=1)

# PREDICT MODEL
predict(xvars, rf.mdl, filename="RfClassPred.img", type="response", 
    index=1, na.rm=TRUE, progress="window", overwrite=TRUE)

added sdata@data

Console:

> setwd("D:/BA-Workspace/DOP_10/orthophotos_abcd/R/test_run_R/test_other")
> 
> 
> library(sp)
> library(rgdal)
> library(raster)
> library(randomForest)
> 
> 
> # CREATE LIST OF RASTERS
> rlist=list.files(getwd(), pattern="tif$", full.names=TRUE) 
> 
> # CREATE RASTER STACK
> xvars <- stack(rlist)      
> 
> # READ Raster TRAINING DATA
> sdata <- readOGR(dsn=getwd(), layer="sw_trainshape")
OGR data source with driver: ESRI Shapefile 
Source: "D:/BA-Workspace/DOP_10/orthophotos_abcd/R/test_run_R/test_other", layer:     "sw_trainshape"
with 256 features and 10 fields
Feature type: wkbPoint with 2 dimensions
> 
> # ASSIGN RASTER VALUES TO TRAINING DATA
> v <- as.data.frame(extract(xvars, sdata))
> sdata@data = data.frame(sdata@data, v[match(rownames(sdata@data), rownames(v)),])
> 
> # RUN RF MODEL
> rf.mdl <- randomForest(x=sdata@data[,3:ncol(sdata@data)],     y=as.factor(sdata@data[,"Class"]),
+                        ntree=501, importance=TRUE)
> 
> # CHECK ERROR CONVERGENCE
> #plot(rf.mdl)
> 
> # PLOT mean decrease in accuracy VARIABLE IMPORTANCE
> #varImpPlot(rf.mdl, type=1)
> #setOldClass(SpatialPointsDataFrame)
> # PREDICT MODEL
> predict(xvars, rf.mdl, filename="RfClassPred.img", type="response", 
+         index=1, na.rm=TRUE, progress="window", overwrite=TRUE)
Error in predict.randomForest(model, blockvals, ...) : 
  variables in the training data missing in newdata

enter image description here

Solution, thanks to TimSalabim:

setwd("D:/BA-Workspace/sw_west_aug/reduced_size/")


library(sp)
library(rgdal)
library(raster)
library(randomForest)


# CREATE LIST OF RASTERS
rlist=list.files(getwd(), pattern="tif$", full.names=TRUE) 

# CREATE RASTER STACK
xvars <- stack(rlist)  

# CREATE RASTER STACK
xvars <- stack(rlist)  
x <- coordinates(xvars)[, 1]
y <- coordinates(xvars)[, 2]

x_rst <- y_rst <- xvars[[1]]
x_rst[] <- x
y_rst[] <- y

xvars <- stack(x_rst, y_rst, xvars)
names(xvars) <- c("X", "Y", "focal_1", "focal_2", "focal_3")
# READ Raster TRAINING DATA
sdata <- readOGR(dsn=getwd(), layer="training_west")

# ASSIGN RASTER VALUES TO TRAINING DATA
v <- as.data.frame(extract(xvars, sdata))
sdata@data = data.frame(sdata@data, v[match(rownames(sdata@data), rownames(v)),])

sdata@data  <- sdata@data[-c(5,6)] 

# RUN RF MODEL
rf.mdl <- randomForest(x=sdata@data[,3:ncol(sdata@data)],   y=as.factor(sdata@data[,"class"]),
                   ntree=501, importance=TRUE)

# CHECK ERROR CONVERGENCE
#plot(rf.mdl)
#sdata@data 

# PLOT mean decrease in accuracy VARIABLE IMPORTANCE
#varImpPlot(rf.mdl, type=1)
#setOldClass(SpatialPointsDataFrame)
# PREDICT MODEL
predict(xvars, rf.mdl, filename="RfClassPred.img", type="response", 
    index=1, na.rm=TRUE, progress="window", overwrite=TRUE)

Best Answer

You need to make sure that names(sdata@data[,3:ncol(sdata@data)]) and names(xvars) are exactly the same. Check this using

identical(names(sdata@data[,3:ncol(sdata@data)]), names(xvars))

If TRUE, your predict should run fine.

The edit related warnings/errors are irrelevant, they relate to you trying to display a SpatialPolygonsDataFrame (and S4 class object) as a standard data.frame in RStudio.

EDIT: It seems you have differences between your stack layer names and your sdata@data data frame. Make sure these are the same. If you would like to include x and y coordinates as layers to your stack (if this makes sense obviously depends on your objective) you could do it like this:

x <- coordinates(xvars)[, 1]
y <- coordinates(xvars)[, 2]

x_rst <- y_rst <- xvars[[1]]
x_rst[] <- x
y_rst[] <- y

Then you would need to add those to your stack at the appropriate position:

xvars <- stack(x_rst, y_rst, xvars)

Note also, that you have additional variables in your sdata@data data frame ("band1.1" etc). I don't know where these come from, maybe you are merging something earlier? Again, for predict() to work properly, the layers of your stack and the columns from your training data need to be identical (the names of these).

Related Solutions

[GIS] The stability of randomForest models after increasing predictor variables

One of the cool thing about random forest is that they probe at each node a random subset of the variables. The ones providing the split with best entropy (or other criteria) will be kept, while others will be discarded and possibly tested in a subsequent / different node. In very simple words, if a variable does not provide any information about the split (e.g. canopy vs non-canopy) will not be used in the final model. From the other point of view, a variable can become informative after a split on another variable, and, possibly, this variable will be useful at prediction time.

In principle, add all the information you can add (better if you have the prior knowledge that are somehow dependent with the outputs of your problem). Surely Landsat bands and images correlate well with the presence of canopy, so add them.

Personally, I use RF classifiers with thousands of variables. In these situations you only have to make sure that you employ many trees (prevent overfitting, in principle, the most you use the better it is) and that at each note you test indicatively sqrt(#variables) in order to explore properly your feature space. Regarding the fraction of training examples to use in bagging, I never observed significant differences in test accuracy. Training time can be better though.

Regarding your second point yes, all the variables that you use at training time have to be present at test (prediction) time. In the same order, with the same scaling. (Note that the scaling of the variable itself is not important for random forest, but the distribution of training and test data must obviously be the same!)

Regarding literature, the best thing I ever read about RF is this ("THE") RF tutorial. It is mostly biased towards computer vision applications (in particular towards the kinect body part recognition thing they developed) but is a very easy and nice read. After that, you should understand the whole RF thing. For remote sensing application IDK, maybe just browse academic journals. Surely a proper search with right keywords will give nice references.

EDIT: probably you will be interest only in the classification part of the above tutorial, but I suggest to read it entirely, it's very nice.

[GIS] Generating prediction raster from Random Forest model using R

Just a quick note on this "problem". When you read in a raster, be it a single raster on in a stack/brick, the default names are the names of the on-disk files. In using the raster::predict function the names in the model object must match the names in the stack/brick. As such, it is good convention the assign the names that you want to use across your modeling workflow. This also provides an addition advantage in easing data management.

Let's say you have a naming convention in your raster layers that correspond to your covariates. You can define a vector of covariate names and then use the vector to read the data with very efficient code.

dummy covariate/raster names

covariates <- paste0(rep("v", 10), 1:10)

Create a vector of rasters (tif) in specified directory. If different from your working directory you can use the full.names = TRUE argument in list.files.

rlist <- list.files(getwd(), "tif$")

Then you can use grep to query the vector of rasters to match your covariate names, and since you already have a vector of names you can then assign it to the stack object. The grep function returns an index, thus the brackets, of the query. Using paste with collapse allows you to pass multiple values to grep, based on the covariates vector.

vars <- stack(rlist[grep(paste(covariates, collapse = "|"), rlist)])
  names(vars) <- covariates

Now, the names issue is solved for the raster::predict function. We should address calling the function itself. It is important to keep in mind that raster::predict is wrapper for other predict functions that each have their own data structures. The example at hand would be the predict method for randomForest:::predict.randomForest. In a classification model, if type="prob" or "votes" a data.frame is returned, with n columns, representing each class. You will notice that raster::predict has some arguments that can control output. The fun argument lets you pass a custom predict function, superseding any existing predict method for the model object. The index argument lets you define the column of a multi-column data.frame or matrix that is returned from a given predict method. With randomForest probability predictions a column is returned for each class so, you have to define with column you want using index. For a binomial model, for returning the prevalence class ["1"] you would use index=2.

raster::predict(model=rf1, object=ApPl_stack, type="prob", index=2)

I would also note, based on the OP's code, that you want to avoid symbolic (formula) model calls if an index interface is possible. For some reason symbolic calls really slow down predictions such as this, specifically in randomForest. Here is what an index call looks like for randomForest.

rf1 <- randomForest(y=factor(dcc.s.dummydcc.s.dummy[,"SITE_NONSITE"]), 
                   x=dcc.s.dummy[,-which(names(dcc.s.dummy)=="SITE_NONSITE")])

Or, if you know the positions of your covariates, simply.

rf1 <- randomForest(y=factor(dcc.s.dummy[,"SITE_NONSITE"]),
                    x=dcc.s.dummy[,2:ncol(dcc.s.dummy)])

For this model, I would also highly recommend addressing model fit through parameter selection. Elsewise, you are fitting random variation in your models and this is reflected in the spatial estimates. Parsimony is actually an important factor in spatial estimates using nonparametric methods. You can address model/parameter selection using the rfUtilites::rf.modelSel as well as addressing multivariate multicollinearity issues and evaluate model fit/performance through a Bootstrap approach.

Best Answer

Related Solutions

[GIS] The stability of randomForest models after increasing predictor variables

[GIS] Generating prediction raster from Random Forest model using R

Related Question