[GIS] Generating prediction raster from Random Forest model using R

rrandom forestraster

I fit a Random Forest model to tabular data from test sites in R, and now would like to generate a raster showing predicted probability values using raster data corresponding to the same predictors (e.g., slope, elevation, pH) that are in the model.

The RF model is built to predict a 0/1 binary variable SITE_NONSITE using different environmental and geophysical data.

#random forest model
set.seed(321)
rf1 <- randomForest(formula=SITE_NONSITE ~., data=dcc.s.dummy, ntree=500, mtry=10)

dcc.s.dummy includes the following data:

str(dcc.s.dummy)
'data.frame':   7899 obs. of  25 variables:
 $ COST_DIST_ECOTONE        : num  -0.232 0.176 -0.443 -0.478 -0.305 ...
 $ COST_DIST_HEA            : num  -0.233 -0.659 -1.055 -0.999 -0.455 ...
 $ COST_DIST_MEDSTR         : num  0.74388 0.63933 0.55964 0.50768 0.00993 ...
 $ COST_DIST_RIV_COAST      : num  0.59 0.63 0.621 0.639 0.617 ...
 $ DEM30_ASP_RE_2           : num  0 0 0 0 1 0 0 0 0 0 ...
 $ DEM30_ASP_RE_3           : num  0 1 0 0 0 0 0 0 1 0 ...
 $ DEM30_ASP_RE_4           : num  1 0 0 0 0 0 0 1 0 0 ...
 $ DEM30_ASP_RE_5           : num  0 0 1 1 0 1 1 0 0 1 ...
 $ DEM30_M                  : num  0.916 0.72 0.499 0.54 1.114 ...
 $ DEM30_SLOPE              : num  0.2063 0.4631 -0.6445 -0.0512 -0.8235 ...
 $ LOC_REL_RE               : num  -0.489 -0.476 -0.476 -0.459 -0.661 ...
 $ LOC_SD_SLOPE             : num  -0.118 -0.135 -0.316 -0.367 -0.57 ...
 $ SSURGO_ESRI_DRAINAGE_RE_2: num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_DRAINAGE_RE_3: num  1 1 1 1 1 1 1 1 1 1 ...
 $ SSURGO_ESRI_DRAINAGE_RE_4: num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_DRAINAGE_RE_5: num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_DRAINAGE_RE_6: num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_EROSION_RE_2 : num  0 0 0 0 0 1 1 0 0 1 ...
 $ SSURGO_ESRI_EROSION_RE_3 : num  1 1 1 0 1 0 0 1 1 0 ...
 $ SSURGO_ESRI_EROSION_RE_4 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_LOC_DIV      : num  -0.328 -0.188 -0.157 -0.213 -0.652 ...
 $ SSURGO_ESRI_NATIVEVEG_2  : num  1 1 1 0 1 0 0 1 1 1 ...
 $ SSURGO_ESRI_NATIVEVEG_3  : num  0 0 0 0 0 1 1 0 0 0 ...
 $ SSURGO_PH                : num  0.813 0.059 1.529 2.32 -1.298 ...
 $ SITE_NONSITE             : Factor w/ 2 levels "0","1": 2 2 2 2 2 1 1 2 2 2

I then take rasters corresponding to these same predictors across my entire study area and combine them into a raster stack:

#plot model predictions
COST_DIST_ECOTONE <- raster("cost_dist_ecotone_s.tif.tif")
COST_DIST_HEA <- raster("cost_dist_hea_s.tif.tif")
COST_DIST_MEDSTR <- raster("cost_dist_medstr_s.tif.tif")
COST_DIST_RIV_COAST <- raster("cost_dist_riv_coast_s.tif.tif")
DEM30_ASP_RE_2 <- raster("dem30_asp_rel_2.tif.tif")
DEM30_ASP_RE_3 <- raster("dem30_asp_rel_3.tif.tif")
DEM30_ASP_RE_4 <- raster("dem30_asp_rel_4.tif.tif")
DEM30_ASP_RE_5 <- raster("dem30_asp_rel_5.tif.tif")
DEM30_M <- raster("dem30_m_s.tif.tif")
DEM30_SLOPE <- raster("dem30_slope_s.tif.tif")
LOC_REL_RE <- raster("loc_rel_re_s.tif.tif")
LOC_SD_SLOPE <- raster("loc_sd_slope_s.tif.tif")
SSURGO_ESRI_DRAINAGE_RE_2 <- raster("SSURGO_ESRI_drainage_reclass_nulfill_2.tif.tif")
SSURGO_ESRI_DRAINAGE_RE_3 <- raster("SSURGO_ESRI_drainage_reclass_nulfill_3.tif.tif")
SSURGO_ESRI_DRAINAGE_RE_4 <- raster("SSURGO_ESRI_drainage_reclass_nulfill_4.tif.tif")
SSURGO_ESRI_DRAINAGE_RE_5 <- raster("SSURGO_ESRI_drainage_reclass_nulfill_5.tif.tif")
SSURGO_ESRI_DRAINAGE_RE_6 <- raster("SSURGO_ESRI_drainage_reclass_nulfill_6.tif.tif")
SSURGO_ESRI_EROSION_RE_2 <- raster("SSURGO_ESRI_erosion_reclass_nulfilll_2.tif.tif")
SSURGO_ESRI_EROSION_RE_3 <- raster("SSURGO_ESRI_erosion_reclass_nulfilll_3.tif.tif")
SSURGO_ESRI_EROSION_RE_4 <- raster("SSURGO_ESRI_erosion_reclass_nulfilll_4.tif.tif")
SSURGO_ESRI_LOC_DIV <- raster("SSURGO_ESRI_loc_div_s.tif.tif")
SSURGO_ESRI_NATIVEVEG_2 <- raster("SSURGO_ESRI_nativeveg_nullfill_2.tif.tif")
SSURGO_ESRI_NATIVEVEG_3 <- raster("SSURGO_ESRI_nativeveg_nullfill_3.tif.tif")
SSURGO_PH <- raster("SSURGO_pH_nullfill_s.tif.tif")

ApPl_stack <- stack(COST_DIST_ECOTONE, COST_DIST_HEA, COST_DIST_MEDSTR, COST_DIST_RIV_COAST, DEM30_ASP_RE_2, DEM30_ASP_RE_3, DEM30_ASP_RE_4, DEM30_ASP_RE_5, DEM30_M, DEM30_SLOPE, LOC_REL_RE, LOC_SD_SLOPE, SSURGO_ESRI_DRAINAGE_RE_2, SSURGO_ESRI_DRAINAGE_RE_3, SSURGO_ESRI_DRAINAGE_RE_4, SSURGO_ESRI_DRAINAGE_RE_5, SSURGO_ESRI_DRAINAGE_RE_6, SSURGO_ESRI_EROSION_RE_2, SSURGO_ESRI_EROSION_RE_3, SSURGO_ESRI_EROSION_RE_4, SSURGO_ESRI_LOC_DIV, SSURGO_ESRI_NATIVEVEG_2, SSURGO_ESRI_NATIVEVEG_3, SSURGO_PH)

However, trying to use this raster stack ApPl_stack in raster::predict() fails with the following error:

ApPl_prob <- raster::predict(rf1, newdata=ApPl_stack, type="prob")

Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot
coerce class ‘structure("RasterLayer", package = "raster")’ to a
data.frame

Converting to a data frame and using that instead generates this error instead:

ApPl_df <- as.data.frame(ApPl_stack, xy=TRUE)
ApPl_prob <- raster::predict(rf1, newdata=ApPl_df, type="prob")

Error in model.frame.default(Terms, newdata, na.action = na.omit) :
object is not a matrix In addition: Warning message: 'newdata' had
658242 rows but variables found have 754 rows

It can't be a coincidence that there are 658242 cells and 754 rows in each of my predictor rasters. What am I missing here? I feel like one of the functions is expecting a data type it's not getting.

Best Answer

Just a quick note on this "problem". When you read in a raster, be it a single raster on in a stack/brick, the default names are the names of the on-disk files. In using the raster::predict function the names in the model object must match the names in the stack/brick. As such, it is good convention the assign the names that you want to use across your modeling workflow. This also provides an addition advantage in easing data management.

Let's say you have a naming convention in your raster layers that correspond to your covariates. You can define a vector of covariate names and then use the vector to read the data with very efficient code.

dummy covariate/raster names

covariates <- paste0(rep("v", 10), 1:10)

Create a vector of rasters (tif) in specified directory. If different from your working directory you can use the full.names = TRUE argument in list.files.

rlist <- list.files(getwd(), "tif$")

Then you can use grep to query the vector of rasters to match your covariate names, and since you already have a vector of names you can then assign it to the stack object. The grep function returns an index, thus the brackets, of the query. Using paste with collapse allows you to pass multiple values to grep, based on the covariates vector.

vars <- stack(rlist[grep(paste(covariates, collapse = "|"), rlist)])
  names(vars) <- covariates

Now, the names issue is solved for the raster::predict function. We should address calling the function itself. It is important to keep in mind that raster::predict is wrapper for other predict functions that each have their own data structures. The example at hand would be the predict method for randomForest:::predict.randomForest. In a classification model, if type="prob" or "votes" a data.frame is returned, with n columns, representing each class. You will notice that raster::predict has some arguments that can control output. The fun argument lets you pass a custom predict function, superseding any existing predict method for the model object. The index argument lets you define the column of a multi-column data.frame or matrix that is returned from a given predict method. With randomForest probability predictions a column is returned for each class so, you have to define with column you want using index. For a binomial model, for returning the prevalence class ["1"] you would use index=2.

raster::predict(model=rf1, object=ApPl_stack, type="prob", index=2)

I would also note, based on the OP's code, that you want to avoid symbolic (formula) model calls if an index interface is possible. For some reason symbolic calls really slow down predictions such as this, specifically in randomForest. Here is what an index call looks like for randomForest.

rf1 <- randomForest(y=factor(dcc.s.dummydcc.s.dummy[,"SITE_NONSITE"]), 
                   x=dcc.s.dummy[,-which(names(dcc.s.dummy)=="SITE_NONSITE")])

Or, if you know the positions of your covariates, simply.

rf1 <- randomForest(y=factor(dcc.s.dummy[,"SITE_NONSITE"]),
                    x=dcc.s.dummy[,2:ncol(dcc.s.dummy)])

For this model, I would also highly recommend addressing model fit through parameter selection. Elsewise, you are fitting random variation in your models and this is reflected in the spatial estimates. Parsimony is actually an important factor in spatial estimates using nonparametric methods. You can address model/parameter selection using the rfUtilites::rf.modelSel as well as addressing multivariate multicollinearity issues and evaluate model fit/performance through a Bootstrap approach.

Related Solutions

[GIS] Rf-Classification seems to work but gives an error 15 seconds later

You need to make sure that names(sdata@data[,3:ncol(sdata@data)]) and names(xvars) are exactly the same. Check this using

identical(names(sdata@data[,3:ncol(sdata@data)]), names(xvars))

If TRUE, your predict should run fine.

The edit related warnings/errors are irrelevant, they relate to you trying to display a SpatialPolygonsDataFrame (and S4 class object) as a standard data.frame in RStudio.

EDIT: It seems you have differences between your stack layer names and your sdata@data data frame. Make sure these are the same. If you would like to include x and y coordinates as layers to your stack (if this makes sense obviously depends on your objective) you could do it like this:

x <- coordinates(xvars)[, 1]
y <- coordinates(xvars)[, 2]

x_rst <- y_rst <- xvars[[1]]
x_rst[] <- x
y_rst[] <- y

Then you would need to add those to your stack at the appropriate position:

xvars <- stack(x_rst, y_rst, xvars)

Note also, that you have additional variables in your sdata@data data frame ("band1.1" etc). I don't know where these come from, maybe you are merging something earlier? Again, for predict() to work properly, the layers of your stack and the columns from your training data need to be identical (the names of these).

[GIS] The stability of randomForest models after increasing predictor variables

One of the cool thing about random forest is that they probe at each node a random subset of the variables. The ones providing the split with best entropy (or other criteria) will be kept, while others will be discarded and possibly tested in a subsequent / different node. In very simple words, if a variable does not provide any information about the split (e.g. canopy vs non-canopy) will not be used in the final model. From the other point of view, a variable can become informative after a split on another variable, and, possibly, this variable will be useful at prediction time.

In principle, add all the information you can add (better if you have the prior knowledge that are somehow dependent with the outputs of your problem). Surely Landsat bands and images correlate well with the presence of canopy, so add them.

Personally, I use RF classifiers with thousands of variables. In these situations you only have to make sure that you employ many trees (prevent overfitting, in principle, the most you use the better it is) and that at each note you test indicatively sqrt(#variables) in order to explore properly your feature space. Regarding the fraction of training examples to use in bagging, I never observed significant differences in test accuracy. Training time can be better though.

Regarding your second point yes, all the variables that you use at training time have to be present at test (prediction) time. In the same order, with the same scaling. (Note that the scaling of the variable itself is not important for random forest, but the distribution of training and test data must obviously be the same!)

Regarding literature, the best thing I ever read about RF is this ("THE") RF tutorial. It is mostly biased towards computer vision applications (in particular towards the kinect body part recognition thing they developed) but is a very easy and nice read. After that, you should understand the whole RF thing. For remote sensing application IDK, maybe just browse academic journals. Surely a proper search with right keywords will give nice references.

EDIT: probably you will be interest only in the classification part of the above tutorial, but I suggest to read it entirely, it's very nice.

Best Answer

Related Solutions

[GIS] Rf-Classification seems to work but gives an error 15 seconds later

[GIS] The stability of randomForest models after increasing predictor variables

Related Question