GAM – How to Select Knots in Generalized Additive Models

generalized-additive-modelrsplines

When selecting an appropriate number of knots for a GAM one might want to take into account the number of data and increments on the x-axis.

What if we have 100 increments on the x-axis with 1000 data points at each increment.

The info here says:

If they are not supplied then the knots of the spline are placed evenly throughout the covariate values to which the term refers: For example, if fitting 101 data with an 11 knot spline of x then there would be a knot at every 10th (ordered) x value.

So a basic start should be 9 knots in this example? I am just not sure what range of knots would be suitable for this data set as it is possible to fit very small to very large numbers.

set.seed(1)
dat <- data.frame(y = rnorm(10000), x = 100)

library(ggplot)
ggplot(dat, aes(x = x, y = y)) + 
              geom_point(size= 0.5) +                      
stat_smooth(method = "gam", 
            formula = y ~ s(x, bs = "cs"),k=9, col = "black")

If k=25 provided a useful fit, would it be reasonable for this data?

Best Answer

Update If you are a stats newbie like me, this answer may suffice. if you want a more correct answer, see Nukimov's answer.

A much better option is to fit your model using gam() in the mgcv package, which contains a method called Generalized Cross-validation (GCV). GCV will automatically choose the number of knots for your model so that simplicity is balanced against explanatory power. When using gam() in mgcv, turn GCV on by setting k to equal -1.

Just like this:

set.seed(1)
dat <- data.frame(y = rnorm(10000), x = rnorm(10000))

library(mgcv)
G1 <- gam(y ~ s(x, k = -1, bs = "cs"), data = dat)
summary(G1) # check the significance of your smooth term
gam.check(G1) # inspect your residuals to evaluate if the degree of smoothing is good

To plot your smooth line you will have to extract the model fit. This should do the trick:

plot(y~x, data = dat, cex = .1)
G1pred <- predict(G1)
I1 <- order(dat$y)
lines(dat$x, G1pred)

You can also adjust k manually, and see what number of k brings you closest to the k value set automatically by GCV.

Related Solutions

Solved – How exactly is the sum (or mean) centering constraint for splines (also w.r.t. gam from mgcv) done

Here is a simpler example using the link from Nemo. The question I answer is

How exactly is the sum (or mean) centering constraint for splines (also w.r.t. gam from mgcv) done?

I answer this as this is the title and as

My question is: Even though the fit is very similar, why do my constrained B-Spline-columns differ from what gam provides? What did I miss?

is rather unclear for reason I provide in the end. Here is the answer to the above question

# simulate data
library(splines)
set.seed(100)
n <- 1000
x <- seq(-4,4,length.out=n)
df <- expand.grid(d = factor(c(0, 1)), x = x)
df <- cbind(y = sin(x) + rnorm(length(df),0,1), df)
x <- df$x

# we start the other way and find the knots `mgcv` uses to make sure we have
# the same knots...
library(mgcv)
mod_gam <- gam(y ~ s(x, bs="ps", k = 7), data = df)
knots <- mod_gam$smooth[[1]]$knots

# find constrained basis as OP describes
X <- splineDesign(knots = knots, x)
C <- rep(1, nrow(X)) %*% X
qrc <- qr(t(C))
Z <- qr.Q(qrc,complete=TRUE)[,(nrow(C)+1):ncol(C)]
XZ <- X%*%Z
rep(1, nrow(X)) %*% XZ # all ~ zero as they should
#R              [,1]          [,2]          [,3]          [,4]          [,5]          [,6]
#R [1,] 2.239042e-13 -2.112754e-13 -3.225198e-13 -6.993017e-14 -2.011724e-13 -3.674838e-14

# now we get roughtly the same basis
all.equal(model.matrix(mod_gam)[, -1], XZ, check.attributes = FALSE)
#R [1] TRUE

# if you want to use a binary by value
mod_gam <- gam(y ~ s(x, bs="ps", k = 7, by = d), data = df)
all.equal(
  model.matrix(mod_gam)[, -1],
  cbind(XZ * (df$d == 0), XZ * (df$d == 1)), check.attributes = FALSE)
#R [1] TRUE

You can do better in terms of computation speed than explicitly computing

Z <- qr.Q(qrc,complete=TRUE)[,(nrow(C)+1):ncol(C)]
XZ <- X%*%Z

as described on page 211 of

Wood, Simon N.. Generalized Additive Models: An Introduction with R, Second Edition (Chapman & Hall/CRC Texts in Statistical Science). CRC Press.

There are some issues in the OP's code

# drawing the sequence
n <- 100
x <- seq(-4,4,length.out=n)
z <- seq(-4,4,length.out=n)
d <- as.factor(0:1)
library(data.table) # OP did not load the library
data <- CJ(x=x,z=z,d=d)
set.seed(100)

# setting up the model
data[, y :=
     # OP only simulate n random terms -- there are 20000 rows
     sin(x+I(d==0)) + sin(x+4*I(d==1)) + I(d==0)*z^2 + 3*I(d==1)*z^2 + rnorm(n,0,1)]

# creating the uncentered B-Spline-Basis for x and z
X <- data[,spline(x,min(x),max(x),5,2,by=d,intercept=FALSE)] # gets an error
#R Error in spline(x, min(x), max(x), 5, 2, by = d, intercept = FALSE) :
#R   unused arguments (by = d, intercept = FALSE)
str(formals(spline)) # here are the formals for `stats::spline`
#R Dotted pair list of 8
#R $ x     : symbol
#R $ y     : NULL
#R $ n     : language 3 * length(x)
#R $ method: chr "fmm"
#R $ xmin  : language min(x)
#R $ xmax  : language max(x)
#R $ xout  : symbol
#R $ ties  : symbol mean

My question is: Even though the fit is very similar, why do my constrained B-Spline-columns differ from what gam provides? What did I miss?

then I do not get how you would expect to get the same. You may have used different knots and I do not see how the spline function would yield the correct results here.

Thw dotted line corresponds to my fit, the straight line to the gam-version

If the latter is fitted with lm then it is un-penalized so the results should differ?

Generalized Additive Model – Interpretation of bs(x) and GAM Results

bs stands for basis spline. A complete understanding takes a bit of a digression into linear algebra.

First, a natural cubic spline is a very specific, rather rigid, species of curve. natural cubic splines come equipped with a collection of knots $x_1, x_2, \ldots, x_n$, and the definition is as follows.

To the left of the sequence of knots, a natural cubic spline is a line.
Between knots, a natural cubic spline is a third degree polynomial curve. Hence the cubic in the name.
At the knots, the curve must be continuous. At the knots, the derivative also must be continuous (no corner). At the knots, the second derivative must be continuous.

Here's a picture of a natural cubic spline:

enter image description here

Ok, now here's the first answer: including bs in a formula fits a natural cubic spline to your data. It can either:

Use every one of your data points as a potential knot.
Determine a sequence of knots with some heuristic, percentiles of the distribution of whatnot.

The fist case may seem insane, and inviting trouble, but there are good theoretical reasons to justify it. In the case you do not specify a degrees of freedom directly, it is possible to determine a "best" answer with cross validation strategies. Leave-one-out cross validation has an especially appealing form for splines (the optimal value can be determined in linear time).

How does the fitting happen? Well, it turns out that the collection of natural cubic splines with a specified set of knots is a vector space. That is, you can add two splines together, or scale a single spline, and what you get back is a spline. This vector space is finite dimensional (convincing yourself of this is a good way to test your understanding). Hence, the set of splines has a basis. Here's a picture of a basis of a space of splines with knots as $.1, .2, \ldots, .9$:

enter image description here

Once you have a basis $s_i$, any other specific spline can be written as a linear combination of the splines in the basis:

$$ s = \sum_i^n \alpha_i s_i $$

So fitting a spline to data goes from finding the best approximating curve from the collection of splines, to finding the values of $\alpha$ that, when combined with a fixed basis, result in a sum spline that best approximates your data.

So when you include bs in a model formula the following happens:

Depending on the model you're fitting, and the parameters you passed to bs, R chooses a set of knots, and a basis for the collection of splines with that set of knots.
R takes all the points in your data set, and feeds them into the basis of splines it chose. You can see this with model.matrix:


    $ dd <- data.frame(x = c(0, 1, 2, 3, 4, 5))
$ model.matrix(~ bs(x, 2), data=dd)
       (Intercept) bs(x, 2)1 bs(x, 2)2 bs(x, 2)3
     1           1     0.000     0.000     0.000
     2           1     0.384     0.096     0.008
     3           1     0.432     0.288     0.064
     4           1     0.288     0.432     0.216
     5           1     0.096     0.384     0.512
     6           1     0.000     0.000     1.000

R uses the resulting vectors as predictors in your model.

So to predict with the model, you ned to know what specific set of knots R chose and what specific basis it chose for the splines at those knots. You should be able to look a the documentation of either bs or gam to determine this information in any specific case.

Best Answer

Related Solutions

Solved – How exactly is the sum (or mean) centering constraint for splines (also w.r.t. gam from mgcv) done

Generalized Additive Model – Interpretation of bs(x) and GAM Results

Related Question