Solved – mgcv::gam overfitting

generalized-additive-modelmgcvoverfitting

I'm trying to fit a cubic smoothing spline model to a longitudinal data using mgcv:gam. I have 30 observations collected from 5 subjects (A, B, C, D, and E) at 6 time points (0, 1, 144, 600, 1440, and 2160 hours). I fit the model as follows:

require(mgcv)
# data is ordered as A-t1, A-t2, ..., A-t6, B-t1, B-t2, ..., E-t6
data <-  c(4.9, 9.8, 6.7, 4.4, 4.5, 4.5, 4.8, 6.8, 4.4, 4.3, 4.4, 4.5, 5.2, 9.6, 5.9,
           4.5, 5.0, 4.7, 5.6, 9.9, 6.0, 4.8, 5.1, 4.5, 4.5, 9.2, 6.2, 4.9, 5.0, 5.7)  
subject <- rep(c("A","B","C","D","E"),each=6)  
subject <- factor(subject)
time <- rep(c(0,1,144,600,1440,2160),5)
# fit gam with random intercept and random slope
fit <- gam(data ~ s(time, k=6, bs="cr") + s(subject, bs="re") + s(time, subject, bs="re"),
           family=gaussian(), method="REML")  
# predict fit at a discretized grid  
t.pred <- seq(0,2160,length.out=100)  
fit.pred <- predict(fit, newdata=data.frame(time=t.pred,subject="C"), type="response",
                    exclude=c("s(subject)","s(time,subject)"), se.fit=F)  

When I plot the predicted fit, it is clear that the model has overfit the data (observed data points are superimposed in black). I tried different parameters (choice of smoother, smoothing parameter estimation method, and gam.control options such as epsilon, maxit, mgcv.tole, etc.) but have not been able to fix this issue. I would appreciate your help!

plot(t.pred, fit.pred, type="l", lwd=2, col="red")  
points(time, data)

Overfit model
I'm also including a trellis plot of raw data, where observations are connected by lines, suggesting the correct trend for the fit.
RawData

Best Answer

The problem is what you mean by 'smooth' here. If you really want a curve that is smooth w.r.t. time and passes through the spike in the data at time 1 then it will have to vary enormously on the y scale. But in reality smoothness on a transformed time scale is probably what is wanted here. For example if we assume smoothness on the 4th root of time scale then the plots look much more like what you probably wanted (I've used uneven spacing for t.pred to make sure the rapidly varying region is well represented)...

fit <- gam(data ~ s(I(time^.25), k=6, bs="cr") + s(subject, bs="re") + 
s(time, subject, bs="re"),family=gaussian(), method="REML")  

t.pred <- seq(0,2160^.25,length.out=2000)^4 
fit.pred <- predict(fit, newdata=data.frame(time=t.pred,subject="C"),
type="response",exclude=c("s(subject)","s(time,subject)"), se.fit=F)  
plot(time,data)
lines(t.pred, fit.pred,lwd=2,col="red")