Solved – GAM cross-validation to test prediction error

cross-validationgeneralized-additive-modelmgcvr

My questions deals with GAMs in the mgcv R package. Due to a small sample size I want to determine the prediction error using leave-one-out cross-validation. Is this reasonable? Is there a package or code how I can do this? The errorest() function in the ipred package does not work. A simple test dataset is:

library(mgcv)
set.seed(0)
dat <- gamSim(1,n=400,dist="normal",scale=2)
b<-gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)
summary(b)
pred <- predict(b, type="response")

Thank you very much for your helping hand!

Best Answer

I really like the package caret for things like this but unfortunately I just read that you can't specify the formula in gam exactly for it.

"When you use train with this model, you cannot (at this time) specify the gam formula. caret has an internal function that figures out a formula based on how many unique levels each predictor has etc. In other words, train currently determines which terms are smoothed and which are plain old linear main effects."

source: https://stackoverflow.com/questions/20044014/error-with-train-from-caret-package-using-method-gam

but if you let train select the smooth terms, in this case it produces your model exactly anyway. The default performance metric in this case is RMSE, but you can change it using the summaryFunction argument of the trainControl function.

I think one of the main drawbacks of LOOCV is that when the dataset is large, it takes forever. Since your dataset is small and it works quite fast, I think it is a sensible option.

Hope this helps.

library(mgcv)
library(caret)

set.seed(0)

dat <- gamSim(1, n = 400, dist = "normal", scale = 2)

b <- train(y ~ x0 + x1 + x2 + x3, 
        data = dat,
        method = "gam",
        trControl = trainControl(method = "LOOCV", number = 1, repeats = 1),
        tuneGrid = data.frame(method = "GCV.Cp", select = FALSE)
)

print(b)
summary(b$finalModel)

output:

> print(b)
Generalized Additive Model using Splines 

400 samples
  9 predictors

No pre-processing
Resampling: 

Summary of sample sizes: 399, 399, 399, 399, 399, 399, ... 

Resampling results

  RMSE      Rsquared 
  2.157964  0.7091647

Tuning parameter 'select' was held constant at a value of FALSE

Tuning parameter 'method' was held constant at a value of GCV.Cp

> summary(b$finalModel)

Family: gaussian 
Link function: identity 

Formula:
.outcome ~ s(x0) + s(x1) + s(x2) + s(x3)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.9150     0.1049   75.44   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
        edf Ref.df       F  p-value    
s(x0) 5.173  6.287   4.564 0.000139 ***
s(x1) 2.357  2.927 103.089  < 2e-16 ***
s(x2) 8.517  8.931  84.308  < 2e-16 ***
s(x3) 1.000  1.000   0.441 0.506929    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.726   Deviance explained = 73.7%
GCV =  4.611  Scale est. = 4.4029    n = 400