Use non-linear regression output to predict values across raster including categorical attributes

rrasterregression

We are aiming to correct general snow depth values from publicly available layers at a coarse resolution (~1km) using our vegetation and other parameters collected in the field at point locations using R. We have successfully built a non-linear regression including one categorical attribute (stand type) and several continuous variables such as basal area, elevation and vegetation cover to correct landscape snow depth values to match the measured ones better. The input into the model is a data table (not a raster). The categorical attribute acts in interaction with the other quantitative attributes. We can use the output from the regression to predict values to other locations as a dataframe with the predict() function but I have not found a way to do it for the raster (using raster matrix as input). Is it possible to run predictions on categorical raster layers with interaction terms?

——-details——–

lm_trial_basic <- lm(snow_actual_cm ~ snow_SNODAS_0_cm + I((snow_SNODAS_0_cm^2)/100) + stand_type*(BA3), data=recorded)
summary(lm_trial_basic)
Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 -10   35  -0.1    0.23    
snow_SNODAS_0_cm              0.03    0.14   0.6    0.6    
I((snow_SNODAS_0_cm^2)/100)   0.58404    0.10609   5.505 5.69e-08 ***
stand_typecon                67   43   0.963    0.336    
stand_typehwd                34   44   1.174    0.241    
stand_typeopen               60   45   1.4    0.231    
BA3                           1.5    1.1   0.99    0.321    
stand_typecon:BA3            -1.50    1.30  -1.152    0.25    
stand_typehwd:BA3            -1.41    1.31  -1.075    0.28   
stand_typeopen:BA3           -2.01    1.30  -1.546    0.12    
Multiple R-squared:  0.7267,    Adjusted R-squared:  0.7221
corrsnow<-predict(lm_trial_basic, newdata=landscape)#this works IF landscape is a dataframe but not a rasterStack

head(recorded)
     snow_actual_cm snow_SNODAS_0_cm stand_type     BA3
  1:           40.0         66.35517        con 31.9542
  2:           36.6         66.35517        con 31.9542
  3:           26.6         66.35517        hwd 28.6610
  4:           47.4         66.35517        open 32.3083
  5:           49.4         66.35517        open 36.8677

I can generate corrected snow depth values (as a raster) with the predict() function using the model output from the data frame and raster values (rasterstack) as input. But, ONLY if I use quantitative values as input (raster stack attribute names matches the names from the model output without a hitch).

lm_trial_simple <- lm(snow_actual_cm ~ snow_SNODAS_0_cm + BA3, data=recorded)
MODEL (sample including variables)
(Intercept)   10
snow_SNODAS_0_cm  0.5
BA3              -0.3

As soon as I try to include the categorical values in the raster prediction, it fails. Even though I made every attempt to match the variables names.

lm_trial_simple <- lm(snow_actual_cm ~ snow_SNODAS_0_cm + stand_type+BA3, data=recorded)
MODEL (sample including variables)
(Intercept)   10
snow_SNODAS_0_cm  0.5
stand_typecon    -15
stand_typehwd     6
stand_typeopen   -3
BA3              -0.3
corrSNODAS_lm <- raster::predict(object=veg_var, model=lm_trial_simple) 
Error in `[.data.frame`(blockvals, , f[j]) : undefined columns selected
In addition: Warning message:
  In .local(object, ...) : factor name(s): stand_type  not in layer names
names(input_rast)
[1] "BA3" "snow_SNODAS_0_cm" "stand_typecon"   "stand_typehwd"   "stand_typeopen"

I matched the raster stack attribute names to the names from the model output but it still does not accept the input/output. The values for each stand_type raster layer are binary presence/absence
Furthermore, the actual model that I want to fit uses interaction terms with the categorical variables, which I am even less hopeful can be done with a raster layer. I am considering: A) converting all raster tiles to points and dataframes for the calculations (which will probably be really long in computing time) B) running the prediction model for each stand type separately.

———————————-sample data—————————-

#set up sample data table (input values)

recorded<-data.frame(snow_actual_cm=rnorm(n=99, mean=70, sd=10), 
                     snow_SNODAS_0_cm=rnorm(n=99, mean=70, sd=10), 
                     stand_type=as.factor(c(rep(c("con", "hwd", "open"), 33))), 
                     BA3=rnorm(n=99, mean=30, sd=5))
recorded$snow_actual_cm<-(-10)+recorded$snow_SNODAS_0_cm*0.03+60+1.5*recorded$BA3*1.5+rnorm(n=99, mean=0, sd=4)
recorded$stand_type<-factor(recorded$stand_type, levels=c("mix", "con", "hwd", "open"))
recorded<-as.data.table(recorded)
#run non-linear regression with sample data
lm_trial_basic <- lm(snow_actual_cm ~ snow_SNODAS_0_cm + I((snow_SNODAS_0_cm^2)/100) + stand_type*(BA3), data=recorded)
summary(lm_trial_basic)
#Coefficients:
#  Estimate Std. Error t value Pr(>|t|)    
#(Intercept)                 40.504693  15.851909   2.555   0.0123 *  
#  snow_SNODAS_0_cm             0.221650   0.432710   0.512   0.6098    
#I((snow_SNODAS_0_cm^2)/100) -0.120353   0.306132  -0.393   0.6952    
#stand_typecon                2.801514   7.601383   0.369   0.7133    
#stand_typehwd                8.078208   8.459568   0.955   0.3422    
#stand_typeopen               1.340825   7.198779   0.186   0.8527    
#BA3                          2.331762   0.192388  12.120   <2e-16 ***
#stand_typecon:BA3           -0.113358   0.246999  -0.459   0.6474    
#stand_typehwd:BA3           -0.279309   0.279532  -0.999   0.3204    
#stand_typeopen:BA3          -0.006518   0.240843  -0.027   0.9785    

#Residual standard error: 4.577 on 89 degrees of freedom
#Multiple R-squared:  0.8944,   Adjusted R-squared:  0.8837 
#F-statistic: 83.74 on 9 and 89 DF,  p-value: < 2.2e-16


#set up sample rasters
x <- raster(ncol=36, nrow=18, xmn=-1000, xmx=1000, ymn=-100, ymx=900)
projection(x) <- "+proj=utm +zone=18 +datum=WGS84"
set.seed(0)
values(x) <- runif(ncell(x))*100

snow_SNODAS_0_cm<-x
values(snow_SNODAS_0_cm)<-values(snow_SNODAS_0_cm)+(1:ncol(snow_SNODAS_0_cm))*18
BA3<-x
values(BA3)<-values(BA3)+(rev(1:ncol(BA3)))*18
standtype<-x
values(standtype)<-rep(sample(1:3), ncell(x)/3)

#combine sample rasters into a rasterStack
standtypes<-layerize(standtype)
names(standtypes)<-c("stand_typehwd", "stand_typecon", "stand_typeopen")
input_rast<-stack(snow_SNODAS_0_cm, BA3, standtypes)
names(input_rast)<-c("snow_SNODAS_0_cm", "BA3", "stand_typehwd", "stand_typecon", "stand_typeopen")
#run predict function on output from sample data using sample rasters
corrSNODAS_lm <- raster::predict(object=input_rast, model=lm_trial_simple) 
#Error in `[.data.frame`(blockvals, , f[j]) : undefined columns selected
#In addition: Warning message:
#  In .local(object, ...) : factor name(s): stand_type  not in layer names

#other alternatives using variables differently
lm_trial_quant <- lm(snow_actual_cm ~ snow_SNODAS_0_cm + I((snow_SNODAS_0_cm^2)/100) + BA3, data=recorded)
corrSNODAS_lm <- raster::predict(object=input_rast, model=lm_trial_quant) #this works
lm_trial_additive <- lm(snow_actual_cm ~ snow_SNODAS_0_cm + I((snow_SNODAS_0_cm^2)/100) + stand_type+BA3, data=recorded)
corrSNODAS_lm <- raster::predict(object=input_rast, model=lm_trial_additive)  #this fails

Best Answer

The raster::predict or terra::predict functions call the native objects predict under the hood. With raster objects, this specific predict function is still being used, but applied to a raster object as a specific class itself, basically reading raster data in and formatting it so it can be passed to the predict function being called based on the model object class. As such, interaction terms should be automatically applied in regard to the linear model predict call because you do not need to do anything special, when calling stats::lm.predict and stats::model.matrix, when using a model with interaction terms.

In looking at your error, I would imagine that the problem is that terra/raster is not parsing the raster values as factors, which is what lm.predict is expecting. This is easily created so, I would coerce the categorical raster(s) in your model to factors using levels from the terra package, and try predicting the model again. You will need to make sure that the raster stack is a SpatRaster class, using rast, before applying levels and predict.

snow_SNODAS_0_cm <- rast(snow_SNODAS_0_cm)
BA3 <- rast(BA3)
standtypes <- rast(standtypes)
  levels(standtype) <- data.frame(id=1:3, stand_type=c('con', 'hwd', 'open'))

input_rast <- c(snow_SNODAS_0_cm, BA3, standtypes)
corrSNODAS_lm <- terra::predict(object=input_rast, model=lm_trial_basic)

Related Solutions

[GIS] How to utilise single ‘randomForest’ model for several rasters in R

Welcome to SE-GIS! It is great that you have read the manual (as most people often forget to do it)! As it was said in the instruction - the names of the predictors must be the same that was used for training. The name of the predictors if you use 'raster' package are generated as 'raster_name'(without extension)+'.'+'band_number', e.g. 'imagery.tif' with 2 bands will generate names 'imagery.1' and 'imagery.2'. So if you want to use model generated for the one raster ('imagery.tif') to predict another ('another_imagery.tif') you need to supply raster with exact the same name that was used in the first place. Just rename the second raster (even temporary just for prediction) - that's all.

Alternatively you may do the same thing without 'raster' package at all. Training data have to be sampled separately and provided as a .csv-file. You will need to convert raster to data frame, rename its bands, make prediction and convert data frame into the raster again. The spatial reference information will be lost during transformation and you will need to recover it one way or another. I tried it once and I find it more tedious then renaming file for 'raster' package.

Here is the code that I used for myself (mostly taken from this article that is in Russian). SRS in this case was recovered externally (outside R). Here I had 5-band raster and training values as .csv-file (sampling of raster values was made in QGIS and results were saved as .csv). You may export to .csv the sampling results that were created using 'raster' package.

library(rgdal)
library(randomForest)

# import training data
train_new <- read.delim("train_new.csv")

#add class 2 (initial set contains only 0 and 1 values for $class)
train_new$class[ ((train_new$build_dist < 500) & (train_new$main_road < 300)) ] <- 2

train_new$class <- as.factor(train_new$class)
band_names <- names(train_new)

# start training
tt <- randomForest(class ~.,  data = train_new, ntree=1000, importance=T)


# import raster data
image <- "image.tif"
x <- new("GDALReadOnlyDataset", image)
width <- dim(x)[2]
height <- dim(x)[1]
imagedata <- data.frame(getRasterTable(x))
bnames <- band_names[1:5]

#rename raster bands to match names of attributes in .csv
names(imagedata) <- c("x", "y", bnames) 
imagedata <- imagedata[3:7] #delete unwanted 'x' and 'y' fields

# prediction
B <- predict(tt, imagedata, type="class")
BM <- as.matrix(B)

# export result to .tiff
tif_driver <- new("GDALDriver", "GTiff")
tif2 <- new("GDALTransientDataset", tif_driver, height, width, 1, 'Byte')
resmtx <- matrix(BM, width, height)
bnd1 <- putRasterData(tif2, resmtx)
tif_file <- "result.tif"
saveDataset(tif2, tif_file)
GDAL.close(tif2)
GDAL.close(tif_driver)

[GIS] using raster.predict for Cubist model, factor levels list

I stick with what I said in my previous comment on this post. I believe that the issue is that you are forcing factors rather than letting the raster predict function handle them. The raster predict wrapper code has explicit factor handling build in.

You should coerce the appropriate variables in your training data to factors and not your rasters. To factorize a raster it has to be pulled into memory, which is not necessary and defeats a memory safe approach using raster. The model object stores class types so, the raster predict function knows if a given variable, corresponding to a raster in the stack, is a factor or numeric. Factorial levels that are present in the new data (raster stack) but not in the training data are assigned NA (NoData).

Following your coding logic it would be more appropriate to use an sp class SpatialPixelsDataFrame or SpatiaGridDataFrame object where, each column represents a covariate and factor coercion is appropriate. In this case, you would just predict a new column to the object.

Best Answer

Related Solutions

[GIS] How to utilise single ‘randomForest’ model for several rasters in R

[GIS] using raster.predict for Cubist model, factor levels list

Related Question