[GIS] using raster.predict for Cubist model, factor levels list

rraster

I am working on a Cubist prediction model for soil texture, using continuous and categorical covariates. I have a raster stack with all the covariates. The layer names in my raster stack match exactly those in the dataframe used for tuning and calibrate the Cubist model. When I use the function clusterR and predict from the raster package on my raster stack, and my Cubist model, the result is a raster with all cero values. I know many people has had problems with categorical covariates. I think all levels in the raster stack are present in the training dataframe, just for one categorical covariate (geol_50) I assigned NA to those cells with factor levels that were missing in the model training dataframe. I converted into factors some raster layers with as.factor(), and assigned factor levels as explained in the examples from the raster package. The I created a list with the factor levels for all those categorical covariates (a list of lists, each individual list for a factor). I am afraid that since the names of the factor levels for some of them are attached in the second column of the attribute table (they are not integers)(I give example below), the function predict does not know where to find the names of the factor levels in the raster layer.

My second concern is whether the order of the raster layers in the stack has to match exactly the order of the columns in my predictor dataframe in Cubist. I guess it does not need to be the same, since the names are what matter.

I provide an example here, with only two variables:

### Previosuly, I had prepared a list with 37 rasterLayers, called rast_na

### Fix missing values for that categorical covariate (3 levels in the raster are missing in the model training dataframe)

rast_na[[11]] <- calc(rast_na[[11]], function(x) {x[x == 1 | x == 12 | x == 15] <- NA ; return(x)}, filename="geol_50_na.tif", type="GTiff", overwrite=T)

### Create raster stack using rast_na

s <- stack() ### create empty raster stack for(i in 1:length(rast_na)){ s <- addLayer(s,rast_na[[i]] ) }

names(s) <-c( "alt_channel","bois00","catch_area","crl00","crop","cti","curv" , "curv_lg","curv_tr","geol_1m","geol_50","hli","idpr","jach00", "k_th","landf","mat_1m","mid_slope","mrrtf","mrvbf","olea00" , "pasture","potassium","pto_centre","slope_height","slope","soil_1m","srtm_90m", "surf_area","surf_rlf","thorium","tmo_centre","total","twi","typo_centre", "uranium","vineyard")

### Convert some raster layers into factors

for (i in c("geol_1m","geol_50","mat_1m","soil_1m","pto_centre","tmo_centre","typo_centre")){ print(i) ### What layer are we working in? print(is.factor(s[[i]])) ### Is it a factor? s[[i]] <- as.factor(s[[i]]) ### Convert into factor, and keep it in the same stack print(is.factor(s[[i]])) ### Did it change into factor? gc() ### Garbage clean! }

### Assign factor levels (example with 2 covariates)

### Geol_1m
f <- levels(s[["geol_1m"]])[[1]] f$classes <- c("Cretaceous", "Eocene", "Jurassic", "Oligo_miocene", "Pliocene", "Pleistoce", "Holocene", "Triassic_paleo") levels(s[["geol_1m"]]) <- f levels(s[["geol_1m"]]) str(s[["geol_1m"]])

### Geol_50
f <- levels(s[["geol_50"]])[[1]] f$classes <- c("Alluvions","Colluvions", "Sables", "Limons", "Marnes", "Calcaires", "Molasses","Craie", "Argiles", "Sables_argilles", "R_cris_grenues", "Metamorphiques", "Limons_arg_silex") levels(s[["geol_50"]]) <- f levels(s[["geol_50"]])

#### Let's see how one raster layer looks like:

s[["geol_50"]] class : RasterLayer dimensions : 3269, 2655, 8679195 (nrow, ncol, ncell) resolution : 90, 90 (x, y) extent : 473760, 712710, 6581160, 6875370 (xmin, xmax, ymin, ymax) coord. ref. : +proj=lcc +lat_1=49 +lat_2=44 +lat_0=46.5 +lon_0=3 +x_0=700000 +y_0=6600000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs data source : D:\romandobarco\prediction_grid\Masked\geol_50_na.tif names : geol_50 values : 2, 16 (min, max) attributes : ID classes from: 2 Alluvions to : 16 Limons_arg_silex

##### Create a list with the factor levels for categorical variables
cat_cov <- c("geol_1m","geol_50","mat_1m","pto_centre","soil_1m","tmo_centre","typo_centre") geol_1m <- levels(s[["geol_1m"]]) geol_50 <- levels(s[["geol_50"]]) mat_1m <- levels(s[["mat_1m"]]) pto_centre <- levels(s[["pto_centre"]]) soil_1m <- levels(s[["soil_1m"]]) tmo_centre <- levels(s[["tmo_centre"]]) typo_centre <- levels(s[["typo_centre"]]) factor_list <- list(geol_1m,geol_50,mat_1m,pto_centre,soil_1m,tmo_centre,typo_centre) names(factor_list) <- cat_cov

`##### Predict using the function cluster R, to use several cores at the same time

My Cubist model is called modelclay_f`

beginCluster() cubistMapclay <- clusterR(s, predict, args=list(modelclay_f,neighbors = 9), factors=factor_list, na.rm=T,inf.rm=T, filename= "clay_alr.tif", format="GTiff",progress="text",overwrite=T) endCluster() plot(cubistMapclay)

And here everything fails.

For example, if I display the levels of geol_50 I get something like this:

geol_50 [[1]] ID classes 1 2 Alluvions 2 3 Colluvions 3 4 Sables 4 5 Limons 5 6 Marnes 6 7 Calcaires 7 8 Molasses 8 9 Craie 9 10 Argiles 10 11 Sables_argilles 11 13 R_cris_grenues 12 14 Metamorphiques 13 16 Limons_arg_silex

Best Answer

I stick with what I said in my previous comment on this post. I believe that the issue is that you are forcing factors rather than letting the raster predict function handle them. The raster predict wrapper code has explicit factor handling build in.

You should coerce the appropriate variables in your training data to factors and not your rasters. To factorize a raster it has to be pulled into memory, which is not necessary and defeats a memory safe approach using raster. The model object stores class types so, the raster predict function knows if a given variable, corresponding to a raster in the stack, is a factor or numeric. Factorial levels that are present in the new data (raster stack) but not in the training data are assigned NA (NoData).

Following your coding logic it would be more appropriate to use an sp class SpatialPixelsDataFrame or SpatiaGridDataFrame object where, each column represents a covariate and factor coercion is appropriate. In this case, you would just predict a new column to the object.

Related Solutions

[GIS] How to utilise single ‘randomForest’ model for several rasters in R

Welcome to SE-GIS! It is great that you have read the manual (as most people often forget to do it)! As it was said in the instruction - the names of the predictors must be the same that was used for training. The name of the predictors if you use 'raster' package are generated as 'raster_name'(without extension)+'.'+'band_number', e.g. 'imagery.tif' with 2 bands will generate names 'imagery.1' and 'imagery.2'. So if you want to use model generated for the one raster ('imagery.tif') to predict another ('another_imagery.tif') you need to supply raster with exact the same name that was used in the first place. Just rename the second raster (even temporary just for prediction) - that's all.

Alternatively you may do the same thing without 'raster' package at all. Training data have to be sampled separately and provided as a .csv-file. You will need to convert raster to data frame, rename its bands, make prediction and convert data frame into the raster again. The spatial reference information will be lost during transformation and you will need to recover it one way or another. I tried it once and I find it more tedious then renaming file for 'raster' package.

Here is the code that I used for myself (mostly taken from this article that is in Russian). SRS in this case was recovered externally (outside R). Here I had 5-band raster and training values as .csv-file (sampling of raster values was made in QGIS and results were saved as .csv). You may export to .csv the sampling results that were created using 'raster' package.

library(rgdal)
library(randomForest)

# import training data
train_new <- read.delim("train_new.csv")

#add class 2 (initial set contains only 0 and 1 values for $class)
train_new$class[ ((train_new$build_dist < 500) & (train_new$main_road < 300)) ] <- 2

train_new$class <- as.factor(train_new$class)
band_names <- names(train_new)

# start training
tt <- randomForest(class ~.,  data = train_new, ntree=1000, importance=T)


# import raster data
image <- "image.tif"
x <- new("GDALReadOnlyDataset", image)
width <- dim(x)[2]
height <- dim(x)[1]
imagedata <- data.frame(getRasterTable(x))
bnames <- band_names[1:5]

#rename raster bands to match names of attributes in .csv
names(imagedata) <- c("x", "y", bnames) 
imagedata <- imagedata[3:7] #delete unwanted 'x' and 'y' fields

# prediction
B <- predict(tt, imagedata, type="class")
BM <- as.matrix(B)

# export result to .tiff
tif_driver <- new("GDALDriver", "GTiff")
tif2 <- new("GDALTransientDataset", tif_driver, height, width, 1, 'Byte')
resmtx <- matrix(BM, width, height)
bnd1 <- putRasterData(tif2, resmtx)
tif_file <- "result.tif"
saveDataset(tif2, tif_file)
GDAL.close(tif2)
GDAL.close(tif_driver)

[GIS] R package spatstat: How to use point process model covariate as factor when pixel image values are numeric

The problem here is a bug in asImRaster, a function from the package geostatsp designed to convert between image formats. After the command anthro2000.africa.img <- asImRaster(f), if you had printed the object anthro2000.africa.img, the printout would have revealed that the pixel values are numeric. The factor levels have been discarded by asImRaster.

You can work around this bug by reinstating the levels:

levels(anthro2000.africa.img) <- levels(f).

[Always print the resulting object to check it is what you expected.]

Then you can proceed to fit the model using ppm as you did before.

[Note: contrary to one of the comments above, this is a perfectly valid point process model. It is explained in detail in chapter 9 of the spatstat book.]

Best Answer

Related Solutions

[GIS] How to utilise single ‘randomForest’ model for several rasters in R

[GIS] R package spatstat: How to use point process model covariate as factor when pixel image values are numeric

Related Question