[GIS] using raster.predict for Cubist model, factor levels list

rraster

I am working on a Cubist prediction model for soil texture, using continuous and categorical covariates. I have a raster stack with all the covariates. The layer names in my raster stack match exactly those in the dataframe used for tuning and calibrate the Cubist model. When I use the function clusterR and predict from the raster package on my raster stack, and my Cubist model, the result is a raster with all cero values. I know many people has had problems with categorical covariates. I think all levels in the raster stack are present in the training dataframe, just for one categorical covariate (geol_50) I assigned NA to those cells with factor levels that were missing in the model training dataframe. I converted into factors some raster layers with as.factor(), and assigned factor levels as explained in the examples from the raster package. The I created a list with the factor levels for all those categorical covariates (a list of lists, each individual list for a factor). I am afraid that since the names of the factor levels for some of them are attached in the second column of the attribute table (they are not integers)(I give example below), the function predict does not know where to find the names of the factor levels in the raster layer.

My second concern is whether the order of the raster layers in the stack has to match exactly the order of the columns in my predictor dataframe in Cubist. I guess it does not need to be the same, since the names are what matter.

I provide an example here, with only two variables:

### Previosuly, I had prepared a list with 37 rasterLayers, called rast_na

### Fix missing values for that categorical covariate (3 levels in the raster are missing in the model training dataframe)

rast_na[[11]] <- calc(rast_na[[11]], function(x) {x[x == 1 | x == 12 | x == 15] <- NA ; return(x)}, filename="geol_50_na.tif", type="GTiff", overwrite=T)

### Create raster stack using rast_na

s <- stack() ### create empty raster stack
for(i in 1:length(rast_na)){
s <- addLayer(s,rast_na[[i]] )
}

names(s) <-c( "alt_channel","bois00","catch_area","crl00","crop","cti","curv" ,
"curv_lg","curv_tr","geol_1m","geol_50","hli","idpr","jach00",
"k_th","landf","mat_1m","mid_slope","mrrtf","mrvbf","olea00" , "pasture","potassium","pto_centre","slope_height","slope","soil_1m","srtm_90m",
"surf_area","surf_rlf","thorium","tmo_centre","total","twi","typo_centre",
"uranium","vineyard")

### Convert some raster layers into factors

for (i in c("geol_1m","geol_50","mat_1m","soil_1m","pto_centre","tmo_centre","typo_centre")){
print(i) ### What layer are we working in?
print(is.factor(s[[i]])) ### Is it a factor?
s[[i]] <- as.factor(s[[i]]) ### Convert into factor, and keep it in the same stack
print(is.factor(s[[i]])) ### Did it change into factor?
gc() ### Garbage clean!
}

### Assign factor levels (example with 2 covariates)

### Geol_1m
f <- levels(s[["geol_1m"]])[[1]]
f$classes <- c("Cretaceous", "Eocene", "Jurassic", "Oligo_miocene", "Pliocene", "Pleistoce", "Holocene", "Triassic_paleo")
levels(s[["geol_1m"]]) <- f
levels(s[["geol_1m"]])
str(s[["geol_1m"]])

### Geol_50
f <- levels(s[["geol_50"]])[[1]]
f$classes <- c("Alluvions","Colluvions", "Sables", "Limons", "Marnes", "Calcaires", "Molasses","Craie",
"Argiles", "Sables_argilles", "R_cris_grenues", "Metamorphiques", "Limons_arg_silex")
levels(s[["geol_50"]]) <- f
levels(s[["geol_50"]])

#### Let's see how one raster layer looks like:

s[["geol_50"]]
class : RasterLayer
dimensions : 3269, 2655, 8679195 (nrow, ncol, ncell)
resolution : 90, 90 (x, y)
extent : 473760, 712710, 6581160, 6875370 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=lcc +lat_1=49 +lat_2=44 +lat_0=46.5 +lon_0=3 +x_0=700000 +y_0=6600000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
data source : D:\romandobarco\prediction_grid\Masked\geol_50_na.tif
names : geol_50
values : 2, 16 (min, max)
attributes :
ID classes
from: 2 Alluvions
to : 16 Limons_arg_silex

##### Create a list with the factor levels for categorical variables
cat_cov <- c("geol_1m","geol_50","mat_1m","pto_centre","soil_1m","tmo_centre","typo_centre")
geol_1m <- levels(s[["geol_1m"]])
geol_50 <- levels(s[["geol_50"]])
mat_1m <- levels(s[["mat_1m"]])
pto_centre <- levels(s[["pto_centre"]])
soil_1m <- levels(s[["soil_1m"]])
tmo_centre <- levels(s[["tmo_centre"]])
typo_centre <- levels(s[["typo_centre"]])
factor_list <- list(geol_1m,geol_50,mat_1m,pto_centre,soil_1m,tmo_centre,typo_centre)
names(factor_list) <- cat_cov

`##### Predict using the function cluster R, to use several cores at the same time

My Cubist model is called modelclay_f`

beginCluster()
cubistMapclay <- clusterR(s, predict, args=list(modelclay_f,neighbors = 9),
factors=factor_list, na.rm=T,inf.rm=T,
filename= "clay_alr.tif",
format="GTiff",progress="text",overwrite=T)
endCluster()
plot(cubistMapclay)

And here everything fails.

For example, if I display the levels of geol_50 I get something like this:

geol_50
[[1]]
ID classes
1 2 Alluvions
2 3 Colluvions
3 4 Sables
4 5 Limons
5 6 Marnes
6 7 Calcaires
7 8 Molasses
8 9 Craie
9 10 Argiles
10 11 Sables_argilles
11 13 R_cris_grenues
12 14 Metamorphiques
13 16 Limons_arg_silex

Best Answer

I stick with what I said in my previous comment on this post. I believe that the issue is that you are forcing factors rather than letting the raster predict function handle them. The raster predict wrapper code has explicit factor handling build in.

You should coerce the appropriate variables in your training data to factors and not your rasters. To factorize a raster it has to be pulled into memory, which is not necessary and defeats a memory safe approach using raster. The model object stores class types so, the raster predict function knows if a given variable, corresponding to a raster in the stack, is a factor or numeric. Factorial levels that are present in the new data (raster stack) but not in the training data are assigned NA (NoData).

Following your coding logic it would be more appropriate to use an sp class SpatialPixelsDataFrame or SpatiaGridDataFrame object where, each column represents a covariate and factor coercion is appropriate. In this case, you would just predict a new column to the object.