GAM – How to Select Knots in Generalized Additive Models

generalized-additive-modelrsplines

When selecting an appropriate number of knots for a GAM one might want to take into account the number of data and increments on the x-axis.

What if we have 100 increments on the x-axis with 1000 data points at each increment.

The info here says:

If they are not supplied then the knots of the spline are placed evenly throughout the covariate values to which the term refers: For example, if fitting 101 data with an 11 knot spline of x then there would be a knot at every 10th (ordered) x value.

So a basic start should be 9 knots in this example? I am just not sure what range of knots would be suitable for this data set as it is possible to fit very small to very large numbers.

set.seed(1)
dat <- data.frame(y = rnorm(10000), x = 100)

library(ggplot)
ggplot(dat, aes(x = x, y = y)) + 
              geom_point(size= 0.5) +                      
stat_smooth(method = "gam", 
            formula = y ~ s(x, bs = "cs"),k=9, col = "black")

If k=25 provided a useful fit, would it be reasonable for this data?

Best Answer

Update If you are a stats newbie like me, this answer may suffice. if you want a more correct answer, see Nukimov's answer.

A much better option is to fit your model using gam() in the mgcv package, which contains a method called Generalized Cross-validation (GCV). GCV will automatically choose the number of knots for your model so that simplicity is balanced against explanatory power. When using gam() in mgcv, turn GCV on by setting k to equal -1.

Just like this:

set.seed(1)
dat <- data.frame(y = rnorm(10000), x = rnorm(10000))

library(mgcv)
G1 <- gam(y ~ s(x, k = -1, bs = "cs"), data = dat)
summary(G1) # check the significance of your smooth term
gam.check(G1) # inspect your residuals to evaluate if the degree of smoothing is good

To plot your smooth line you will have to extract the model fit. This should do the trick:

plot(y~x, data = dat, cex = .1)
G1pred <- predict(G1)
I1 <- order(dat$y)
lines(dat$x, G1pred)

You can also adjust k manually, and see what number of k brings you closest to the k value set automatically by GCV.