[GIS] Generating prediction raster from Random Forest model using R

rrandom forestraster

I fit a Random Forest model to tabular data from test sites in R, and now would like to generate a raster showing predicted probability values using raster data corresponding to the same predictors (e.g., slope, elevation, pH) that are in the model.

The RF model is built to predict a 0/1 binary variable SITE_NONSITE using different environmental and geophysical data.

#random forest model
set.seed(321)
rf1 <- randomForest(formula=SITE_NONSITE ~., data=dcc.s.dummy, ntree=500, mtry=10)

dcc.s.dummy includes the following data:

str(dcc.s.dummy)
'data.frame':   7899 obs. of  25 variables:
 $ COST_DIST_ECOTONE        : num  -0.232 0.176 -0.443 -0.478 -0.305 ...
 $ COST_DIST_HEA            : num  -0.233 -0.659 -1.055 -0.999 -0.455 ...
 $ COST_DIST_MEDSTR         : num  0.74388 0.63933 0.55964 0.50768 0.00993 ...
 $ COST_DIST_RIV_COAST      : num  0.59 0.63 0.621 0.639 0.617 ...
 $ DEM30_ASP_RE_2           : num  0 0 0 0 1 0 0 0 0 0 ...
 $ DEM30_ASP_RE_3           : num  0 1 0 0 0 0 0 0 1 0 ...
 $ DEM30_ASP_RE_4           : num  1 0 0 0 0 0 0 1 0 0 ...
 $ DEM30_ASP_RE_5           : num  0 0 1 1 0 1 1 0 0 1 ...
 $ DEM30_M                  : num  0.916 0.72 0.499 0.54 1.114 ...
 $ DEM30_SLOPE              : num  0.2063 0.4631 -0.6445 -0.0512 -0.8235 ...
 $ LOC_REL_RE               : num  -0.489 -0.476 -0.476 -0.459 -0.661 ...
 $ LOC_SD_SLOPE             : num  -0.118 -0.135 -0.316 -0.367 -0.57 ...
 $ SSURGO_ESRI_DRAINAGE_RE_2: num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_DRAINAGE_RE_3: num  1 1 1 1 1 1 1 1 1 1 ...
 $ SSURGO_ESRI_DRAINAGE_RE_4: num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_DRAINAGE_RE_5: num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_DRAINAGE_RE_6: num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_EROSION_RE_2 : num  0 0 0 0 0 1 1 0 0 1 ...
 $ SSURGO_ESRI_EROSION_RE_3 : num  1 1 1 0 1 0 0 1 1 0 ...
 $ SSURGO_ESRI_EROSION_RE_4 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_LOC_DIV      : num  -0.328 -0.188 -0.157 -0.213 -0.652 ...
 $ SSURGO_ESRI_NATIVEVEG_2  : num  1 1 1 0 1 0 0 1 1 1 ...
 $ SSURGO_ESRI_NATIVEVEG_3  : num  0 0 0 0 0 1 1 0 0 0 ...
 $ SSURGO_PH                : num  0.813 0.059 1.529 2.32 -1.298 ...
 $ SITE_NONSITE             : Factor w/ 2 levels "0","1": 2 2 2 2 2 1 1 2 2 2

I then take rasters corresponding to these same predictors across my entire study area and combine them into a raster stack:

#plot model predictions
COST_DIST_ECOTONE <- raster("cost_dist_ecotone_s.tif.tif")
COST_DIST_HEA <- raster("cost_dist_hea_s.tif.tif")
COST_DIST_MEDSTR <- raster("cost_dist_medstr_s.tif.tif")
COST_DIST_RIV_COAST <- raster("cost_dist_riv_coast_s.tif.tif")
DEM30_ASP_RE_2 <- raster("dem30_asp_rel_2.tif.tif")
DEM30_ASP_RE_3 <- raster("dem30_asp_rel_3.tif.tif")
DEM30_ASP_RE_4 <- raster("dem30_asp_rel_4.tif.tif")
DEM30_ASP_RE_5 <- raster("dem30_asp_rel_5.tif.tif")
DEM30_M <- raster("dem30_m_s.tif.tif")
DEM30_SLOPE <- raster("dem30_slope_s.tif.tif")
LOC_REL_RE <- raster("loc_rel_re_s.tif.tif")
LOC_SD_SLOPE <- raster("loc_sd_slope_s.tif.tif")
SSURGO_ESRI_DRAINAGE_RE_2 <- raster("SSURGO_ESRI_drainage_reclass_nulfill_2.tif.tif")
SSURGO_ESRI_DRAINAGE_RE_3 <- raster("SSURGO_ESRI_drainage_reclass_nulfill_3.tif.tif")
SSURGO_ESRI_DRAINAGE_RE_4 <- raster("SSURGO_ESRI_drainage_reclass_nulfill_4.tif.tif")
SSURGO_ESRI_DRAINAGE_RE_5 <- raster("SSURGO_ESRI_drainage_reclass_nulfill_5.tif.tif")
SSURGO_ESRI_DRAINAGE_RE_6 <- raster("SSURGO_ESRI_drainage_reclass_nulfill_6.tif.tif")
SSURGO_ESRI_EROSION_RE_2 <- raster("SSURGO_ESRI_erosion_reclass_nulfilll_2.tif.tif")
SSURGO_ESRI_EROSION_RE_3 <- raster("SSURGO_ESRI_erosion_reclass_nulfilll_3.tif.tif")
SSURGO_ESRI_EROSION_RE_4 <- raster("SSURGO_ESRI_erosion_reclass_nulfilll_4.tif.tif")
SSURGO_ESRI_LOC_DIV <- raster("SSURGO_ESRI_loc_div_s.tif.tif")
SSURGO_ESRI_NATIVEVEG_2 <- raster("SSURGO_ESRI_nativeveg_nullfill_2.tif.tif")
SSURGO_ESRI_NATIVEVEG_3 <- raster("SSURGO_ESRI_nativeveg_nullfill_3.tif.tif")
SSURGO_PH <- raster("SSURGO_pH_nullfill_s.tif.tif")

ApPl_stack <- stack(COST_DIST_ECOTONE, COST_DIST_HEA, COST_DIST_MEDSTR, COST_DIST_RIV_COAST, DEM30_ASP_RE_2, DEM30_ASP_RE_3, DEM30_ASP_RE_4, DEM30_ASP_RE_5, DEM30_M, DEM30_SLOPE, LOC_REL_RE, LOC_SD_SLOPE, SSURGO_ESRI_DRAINAGE_RE_2, SSURGO_ESRI_DRAINAGE_RE_3, SSURGO_ESRI_DRAINAGE_RE_4, SSURGO_ESRI_DRAINAGE_RE_5, SSURGO_ESRI_DRAINAGE_RE_6, SSURGO_ESRI_EROSION_RE_2, SSURGO_ESRI_EROSION_RE_3, SSURGO_ESRI_EROSION_RE_4, SSURGO_ESRI_LOC_DIV, SSURGO_ESRI_NATIVEVEG_2, SSURGO_ESRI_NATIVEVEG_3, SSURGO_PH)

However, trying to use this raster stack ApPl_stack in raster::predict() fails with the following error:

ApPl_prob <- raster::predict(rf1, newdata=ApPl_stack, type="prob")

Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot
coerce class ‘structure("RasterLayer", package = "raster")’ to a
data.frame

Converting to a data frame and using that instead generates this error instead:

ApPl_df <- as.data.frame(ApPl_stack, xy=TRUE)
ApPl_prob <- raster::predict(rf1, newdata=ApPl_df, type="prob")

Error in model.frame.default(Terms, newdata, na.action = na.omit) :
object is not a matrix In addition: Warning message: 'newdata' had
658242 rows but variables found have 754 rows

It can't be a coincidence that there are 658242 cells and 754 rows in each of my predictor rasters. What am I missing here? I feel like one of the functions is expecting a data type it's not getting.

Best Answer

Just a quick note on this "problem". When you read in a raster, be it a single raster on in a stack/brick, the default names are the names of the on-disk files. In using the raster::predict function the names in the model object must match the names in the stack/brick. As such, it is good convention the assign the names that you want to use across your modeling workflow. This also provides an addition advantage in easing data management.

Let's say you have a naming convention in your raster layers that correspond to your covariates. You can define a vector of covariate names and then use the vector to read the data with very efficient code.

dummy covariate/raster names

covariates <- paste0(rep("v", 10), 1:10) 

Create a vector of rasters (tif) in specified directory. If different from your working directory you can use the full.names = TRUE argument in list.files.

rlist <- list.files(getwd(), "tif$")

Then you can use grep to query the vector of rasters to match your covariate names, and since you already have a vector of names you can then assign it to the stack object. The grep function returns an index, thus the brackets, of the query. Using paste with collapse allows you to pass multiple values to grep, based on the covariates vector.

vars <- stack(rlist[grep(paste(covariates, collapse = "|"), rlist)])
  names(vars) <- covariates 

Now, the names issue is solved for the raster::predict function. We should address calling the function itself. It is important to keep in mind that raster::predict is wrapper for other predict functions that each have their own data structures. The example at hand would be the predict method for randomForest:::predict.randomForest. In a classification model, if type="prob" or "votes" a data.frame is returned, with n columns, representing each class. You will notice that raster::predict has some arguments that can control output. The fun argument lets you pass a custom predict function, superseding any existing predict method for the model object. The index argument lets you define the column of a multi-column data.frame or matrix that is returned from a given predict method. With randomForest probability predictions a column is returned for each class so, you have to define with column you want using index. For a binomial model, for returning the prevalence class ["1"] you would use index=2.

raster::predict(model=rf1, object=ApPl_stack, type="prob", index=2)

I would also note, based on the OP's code, that you want to avoid symbolic (formula) model calls if an index interface is possible. For some reason symbolic calls really slow down predictions such as this, specifically in randomForest. Here is what an index call looks like for randomForest.

rf1 <- randomForest(y=factor(dcc.s.dummydcc.s.dummy[,"SITE_NONSITE"]), 
                   x=dcc.s.dummy[,-which(names(dcc.s.dummy)=="SITE_NONSITE")])

Or, if you know the positions of your covariates, simply.

rf1 <- randomForest(y=factor(dcc.s.dummy[,"SITE_NONSITE"]),
                    x=dcc.s.dummy[,2:ncol(dcc.s.dummy)])

For this model, I would also highly recommend addressing model fit through parameter selection. Elsewise, you are fitting random variation in your models and this is reflected in the spatial estimates. Parsimony is actually an important factor in spatial estimates using nonparametric methods. You can address model/parameter selection using the rfUtilites::rf.modelSel as well as addressing multivariate multicollinearity issues and evaluate model fit/performance through a Bootstrap approach.