Solved – Model selection with nonlinear predictor

least squaresrregression

I am fairly new to statistics, so please forgive me for using probably the wrong vocabulary.

I have a dataset of locations with measurements of soil carbon SOC1mTHA. I want to study the predictors that best explain SOC1mTHA, among 15 potential predictors. I have used the glmulti package, and finally concluded that based on AICc of all potential model combinations, the best set of predictors is:

> global.model <-  glm(SOC1mTHA ~ CN + pH + MATcru + PETsplash + MAPcru + 
+                        ECM + PGB2, data= dat2)
> summary(global.model)

Call:
glm(formula = SOC1mTHA ~ CN + pH + MATcru + PETsplash + MAPcru + 
    ECM + PGB2, data = dat2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-246.02   -78.72   -31.71   101.80   289.56  

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -165.25860  179.19404  -0.922 0.358695    
CN            35.42207    5.54562   6.387 5.85e-09 ***
pH             6.37200    1.80541   3.529 0.000639 ***
MATcru        18.86837    5.68337   3.320 0.001270 ** 
PETsplash     -0.87748    0.12285  -7.143 1.69e-10 ***
MAPcru         0.11970    0.02665   4.491 1.96e-05 ***
ECM           -1.86698    0.51799  -3.604 0.000496 ***
PGB2           2.76935    0.62110   4.459 2.22e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 14572.72)

    Null deviance: 4772480  on 104  degrees of freedom
Residual deviance: 1413554  on  97  degrees of freedom
AIC: 1314.3

Number of Fisher Scoring iterations: 2

However, when visualising the regressions of the individual predictors with visreg, it appears that one of the regressions is nonlinear. In particular, the relationship of soil pH and SOC1mTHA could be a humped-shaped gaussian curve around pH=7.0 (which makes sense in biological terms). Please, see attached figure.

visreg(global.model)

enter image description here

What would be the next step? Should I define global.model as composed by 6 linear regressions + 1 nonlinear regression? How could I do it? Sorry if my question is silly or to broad. Some readings that would help me figure out the solution are also welcome.

EDITED

Following the suggestion by @adibender I have changed the strategy, running model selection and GAM in one single step with select=TRUEin order to detect potential nonlinear predictors beyond pH.

I have therefore included the full model with all potential predictors in GAM:

> global.model2 <- gam(SOC1mTHA ~ s(CN) + s(pH) + s(MATcru) + s(PETsplash) + s(MAPcru) + 
+                        s(ECM) + s(PGB2) + s(moisture) + s(clay), 
+                      data= dat2, select = TRUE)
> summary(global.model2)

Family: gaussian 
Link function: identity 

Formula:
SOC1mTHA ~ s(CN) + s(pH) + s(MATcru) + s(PETsplash) + s(MAPcru) + 
    s(ECM) + s(PGB2) + s(moisture) + s(clay)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  286.743      3.703   77.44   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                   edf Ref.df      F  p-value    
s(CN)        5.285e-11      9  0.000 0.230133    
s(pH)        7.145e+00      9  2.841 0.000272 ***
s(MATcru)    7.475e+00      9 27.030  < 2e-16 ***
s(PETsplash) 2.308e+00      9  3.510 3.22e-09 ***
s(MAPcru)    7.441e+00      9 11.023 8.01e-16 ***
s(ECM)       5.550e+00      8 15.786  < 2e-16 ***
s(PGB2)      1.498e+00      8  1.987 2.80e-05 ***
s(moisture)  4.146e-11      9  0.000 0.610718    
s(clay)      1.935e+00      9  2.325 4.34e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Rank: 81/82
R-sq.(adj) =  0.969   Deviance explained = 97.9%
GCV = 2139.9  Scale est. = 1439.8    n = 105
  1. How do I know what are the important predictors kept by gam, similar to what I would obtain with glmulti, stepwise regression or a simple "best model" based on AIC? In other words, how do I know which predictors are unimportant, so I don't even need to gather data on them to predict SOC1mTHA?

  2. I have never used a gam model, so I am not sure how to interpret the results. What would be the form of my final model that can be used to predict SOC1mTHA based on the results of summary(global.model2)above?

Best Answer

Non-linear effects: If you have enough observations, you should be assuming potential non-linearity in all continuous covariates and fit a Generalized Additive Model (GAM) instead. If effects are linear, they will be estimated as such due to penalty.

To fit such models you can use mgcv::gam

library(mgcv)
global.model <- gam(SOC1mTHA ~ CN + s(pH) + MATcru + PETsplash + 
    MAPcru + ECM + PGB2, data= dat2)
plot(global.model)

Model/Variable selection: Note, if you additionally set select=TRUE in the call to gam, this will also perform model selection, as individual terms may not only be penalized towards a linear effect, but also towards 0 (see this reference).

Warning: No matter what type of model selection scheme you choose, standard inference will no longer be valid. E.g. the p-values displayed in your output are worthless, unless you adjust for previous AIC selection. This field of statistics is called Post Selection Inference (POSI), but there are few ready to use solutions.