Use non-linear regression output to predict values across raster including categorical attributes

rrasterregression

We are aiming to correct general snow depth values from publicly available layers at a coarse resolution (~1km) using our vegetation and other parameters collected in the field at point locations using R. We have successfully built a non-linear regression including one categorical attribute (stand type) and several continuous variables such as basal area, elevation and vegetation cover to correct landscape snow depth values to match the measured ones better. The input into the model is a data table (not a raster). The categorical attribute acts in interaction with the other quantitative attributes. We can use the output from the regression to predict values to other locations as a dataframe with the predict() function but I have not found a way to do it for the raster (using raster matrix as input). Is it possible to run predictions on categorical raster layers with interaction terms?


——-details——–

lm_trial_basic <- lm(snow_actual_cm ~ snow_SNODAS_0_cm + I((snow_SNODAS_0_cm^2)/100) + stand_type*(BA3), data=recorded)
summary(lm_trial_basic)
Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 -10   35  -0.1    0.23    
snow_SNODAS_0_cm              0.03    0.14   0.6    0.6    
I((snow_SNODAS_0_cm^2)/100)   0.58404    0.10609   5.505 5.69e-08 ***
stand_typecon                67   43   0.963    0.336    
stand_typehwd                34   44   1.174    0.241    
stand_typeopen               60   45   1.4    0.231    
BA3                           1.5    1.1   0.99    0.321    
stand_typecon:BA3            -1.50    1.30  -1.152    0.25    
stand_typehwd:BA3            -1.41    1.31  -1.075    0.28   
stand_typeopen:BA3           -2.01    1.30  -1.546    0.12    
Multiple R-squared:  0.7267,    Adjusted R-squared:  0.7221
corrsnow<-predict(lm_trial_basic, newdata=landscape)#this works IF landscape is a dataframe but not a rasterStack

head(recorded)
     snow_actual_cm snow_SNODAS_0_cm stand_type     BA3
  1:           40.0         66.35517        con 31.9542
  2:           36.6         66.35517        con 31.9542
  3:           26.6         66.35517        hwd 28.6610
  4:           47.4         66.35517        open 32.3083
  5:           49.4         66.35517        open 36.8677

I can generate corrected snow depth values (as a raster) with the predict() function using the model output from the data frame and raster values (rasterstack) as input. But, ONLY if I use quantitative values as input (raster stack attribute names matches the names from the model output without a hitch).

lm_trial_simple <- lm(snow_actual_cm ~ snow_SNODAS_0_cm + BA3, data=recorded)
MODEL (sample including variables)
(Intercept)   10
snow_SNODAS_0_cm  0.5
BA3              -0.3

As soon as I try to include the categorical values in the raster prediction, it fails. Even though I made every attempt to match the variables names.

lm_trial_simple <- lm(snow_actual_cm ~ snow_SNODAS_0_cm + stand_type+BA3, data=recorded)
MODEL (sample including variables)
(Intercept)   10
snow_SNODAS_0_cm  0.5
stand_typecon    -15
stand_typehwd     6
stand_typeopen   -3
BA3              -0.3
corrSNODAS_lm <- raster::predict(object=veg_var, model=lm_trial_simple) 
Error in `[.data.frame`(blockvals, , f[j]) : undefined columns selected
In addition: Warning message:
  In .local(object, ...) : factor name(s): stand_type  not in layer names
names(input_rast)
[1] "BA3" "snow_SNODAS_0_cm" "stand_typecon"   "stand_typehwd"   "stand_typeopen"

I matched the raster stack attribute names to the names from the model output but it still does not accept the input/output. The values for each stand_type raster layer are binary presence/absence
Furthermore, the actual model that I want to fit uses interaction terms with the categorical variables, which I am even less hopeful can be done with a raster layer. I am considering: A) converting all raster tiles to points and dataframes for the calculations (which will probably be really long in computing time) B) running the prediction model for each stand type separately.


———————————-sample data—————————-

#set up sample data table (input values)

recorded<-data.frame(snow_actual_cm=rnorm(n=99, mean=70, sd=10), 
                     snow_SNODAS_0_cm=rnorm(n=99, mean=70, sd=10), 
                     stand_type=as.factor(c(rep(c("con", "hwd", "open"), 33))), 
                     BA3=rnorm(n=99, mean=30, sd=5))
recorded$snow_actual_cm<-(-10)+recorded$snow_SNODAS_0_cm*0.03+60+1.5*recorded$BA3*1.5+rnorm(n=99, mean=0, sd=4)
recorded$stand_type<-factor(recorded$stand_type, levels=c("mix", "con", "hwd", "open"))
recorded<-as.data.table(recorded)
#run non-linear regression with sample data
lm_trial_basic <- lm(snow_actual_cm ~ snow_SNODAS_0_cm + I((snow_SNODAS_0_cm^2)/100) + stand_type*(BA3), data=recorded)
summary(lm_trial_basic)
#Coefficients:
#  Estimate Std. Error t value Pr(>|t|)    
#(Intercept)                 40.504693  15.851909   2.555   0.0123 *  
#  snow_SNODAS_0_cm             0.221650   0.432710   0.512   0.6098    
#I((snow_SNODAS_0_cm^2)/100) -0.120353   0.306132  -0.393   0.6952    
#stand_typecon                2.801514   7.601383   0.369   0.7133    
#stand_typehwd                8.078208   8.459568   0.955   0.3422    
#stand_typeopen               1.340825   7.198779   0.186   0.8527    
#BA3                          2.331762   0.192388  12.120   <2e-16 ***
#stand_typecon:BA3           -0.113358   0.246999  -0.459   0.6474    
#stand_typehwd:BA3           -0.279309   0.279532  -0.999   0.3204    
#stand_typeopen:BA3          -0.006518   0.240843  -0.027   0.9785    

#Residual standard error: 4.577 on 89 degrees of freedom
#Multiple R-squared:  0.8944,   Adjusted R-squared:  0.8837 
#F-statistic: 83.74 on 9 and 89 DF,  p-value: < 2.2e-16


#set up sample rasters
x <- raster(ncol=36, nrow=18, xmn=-1000, xmx=1000, ymn=-100, ymx=900)
projection(x) <- "+proj=utm +zone=18 +datum=WGS84"
set.seed(0)
values(x) <- runif(ncell(x))*100

snow_SNODAS_0_cm<-x
values(snow_SNODAS_0_cm)<-values(snow_SNODAS_0_cm)+(1:ncol(snow_SNODAS_0_cm))*18
BA3<-x
values(BA3)<-values(BA3)+(rev(1:ncol(BA3)))*18
standtype<-x
values(standtype)<-rep(sample(1:3), ncell(x)/3)

#combine sample rasters into a rasterStack
standtypes<-layerize(standtype)
names(standtypes)<-c("stand_typehwd", "stand_typecon", "stand_typeopen")
input_rast<-stack(snow_SNODAS_0_cm, BA3, standtypes)
names(input_rast)<-c("snow_SNODAS_0_cm", "BA3", "stand_typehwd", "stand_typecon", "stand_typeopen")
#run predict function on output from sample data using sample rasters
corrSNODAS_lm <- raster::predict(object=input_rast, model=lm_trial_simple) 
#Error in `[.data.frame`(blockvals, , f[j]) : undefined columns selected
#In addition: Warning message:
#  In .local(object, ...) : factor name(s): stand_type  not in layer names

#other alternatives using variables differently
lm_trial_quant <- lm(snow_actual_cm ~ snow_SNODAS_0_cm + I((snow_SNODAS_0_cm^2)/100) + BA3, data=recorded)
corrSNODAS_lm <- raster::predict(object=input_rast, model=lm_trial_quant) #this works
lm_trial_additive <- lm(snow_actual_cm ~ snow_SNODAS_0_cm + I((snow_SNODAS_0_cm^2)/100) + stand_type+BA3, data=recorded)
corrSNODAS_lm <- raster::predict(object=input_rast, model=lm_trial_additive)  #this fails

Best Answer

The raster::predict or terra::predict functions call the native objects predict under the hood. With raster objects, this specific predict function is still being used, but applied to a raster object as a specific class itself, basically reading raster data in and formatting it so it can be passed to the predict function being called based on the model object class. As such, interaction terms should be automatically applied in regard to the linear model predict call because you do not need to do anything special, when calling stats::lm.predict and stats::model.matrix, when using a model with interaction terms.

In looking at your error, I would imagine that the problem is that terra/raster is not parsing the raster values as factors, which is what lm.predict is expecting. This is easily created so, I would coerce the categorical raster(s) in your model to factors using levels from the terra package, and try predicting the model again. You will need to make sure that the raster stack is a SpatRaster class, using rast, before applying levels and predict.

snow_SNODAS_0_cm <- rast(snow_SNODAS_0_cm)
BA3 <- rast(BA3)
standtypes <- rast(standtypes)
  levels(standtype) <- data.frame(id=1:3, stand_type=c('con', 'hwd', 'open'))

input_rast <- c(snow_SNODAS_0_cm, BA3, standtypes)
corrSNODAS_lm <- terra::predict(object=input_rast, model=lm_trial_basic)