Solved – How to determine the type of spline in GAM

generalized-additive-modelrregressionsplines

I want to model a variable in the form of

fit <- gam(y ~ x + s(z,k=-1,bs=?))

To let the function decide the ideal number of knots, I am using k=-1 but I am not able to figure out what type of spline to use. I have added a plot below. Please advise or if possible direct me to sources where I can learn about it. Thanks.

Best Answer

In the documentation for the mgcv package, there is a page describing the spline-based smoothers available. Moreover Wood (the package author) offers the following advice:

Broadly speaking the default penalized thin plate regression splines tend to give the best MSE performance, but they are slower to set up than the other bases.

Since in your case you have less than 200 data points, so I don't think you will run into computational issues with the default method. In section 4.1 of Wood's book "Generalized Additive Models: an introduction with R", he has a summary of the major smoothing bases (Thin plate regression splines, Duchon splines, Cubic regression splines, P-splines) available in mgcv along with a discussion of their merits and other practical considerations. I have found the book quite helpful in developing my understanding of GAMs.

Related Solutions

GAM – How to Select Knots in Generalized Additive Models

Update If you are a stats newbie like me, this answer may suffice. if you want a more correct answer, see Nukimov's answer.

A much better option is to fit your model using gam() in the mgcv package, which contains a method called Generalized Cross-validation (GCV). GCV will automatically choose the number of knots for your model so that simplicity is balanced against explanatory power. When using gam() in mgcv, turn GCV on by setting k to equal -1.

Just like this:

set.seed(1)
dat <- data.frame(y = rnorm(10000), x = rnorm(10000))

library(mgcv)
G1 <- gam(y ~ s(x, k = -1, bs = "cs"), data = dat)
summary(G1) # check the significance of your smooth term
gam.check(G1) # inspect your residuals to evaluate if the degree of smoothing is good

To plot your smooth line you will have to extract the model fit. This should do the trick:

plot(y~x, data = dat, cex = .1)
G1pred <- predict(G1)
I1 <- order(dat$y)
lines(dat$x, G1pred)

You can also adjust k manually, and see what number of k brings you closest to the k value set automatically by GCV.

Solved – Overlaying GAM and LM in R

We can't use covariates in formula in ggplot2 (if we use the tidyverse). So we have to build the models and the fit independently, merge the dataframes, then we can plot our results.

library(tidyverse)
library(mgcv)

set.seed(0)

mydata <- tibble(exposure = 1:20,
                 covariate = rnorm(20, 0, 0.1),
                 outcome = rnorm(20, 5, 2))

fit_lm <- mydata %>% 
  lm(outcome ~ exposure + covariate, data = .) %>% 
  predict(newdata = mydata, interval = "confidence") %>% 
  as_data_frame()

fit_gam <- mydata %>% 
  gam(outcome ~ s(exposure, k = 5) + covariate, data = .) %>% 
  predict(newdata = mydata, type = "link", se.fit = TRUE) %>% 
  as_data_frame() %>% 
  rename(fit_gam = fit) %>% 
  mutate(lwr_gam = fit_gam - 2 * se.fit,
         upr_gam = fit_gam + 2 * se.fit) %>% 
  select(-se.fit)

mydata %>% 
  bind_cols(fit_lm, fit_gam) %>% 
  ggplot() +
    geom_point(aes(exposure, outcome)) +
    geom_line(aes(exposure, fit), size = 1, color = "red") +
    geom_ribbon(aes(x = exposure, ymin = lwr, ymax = upr), alpha = 0.2, fill = "red") +
    geom_line(aes(exposure, fit_gam), size = 1, color = "blue") +
    geom_ribbon(aes(x = exposure, ymin = lwr_gam, ymax = upr_gam), alpha = 0.2, fill = "blue")

NB : check that the confidence interval for the GAM is correct for you (see Confidence interval for GAM model).

Best Answer

Related Solutions

GAM – How to Select Knots in Generalized Additive Models

Solved – Overlaying GAM and LM in R

Related Question