Here is a simpler example using the link from Nemo. The question I answer is
How exactly is the sum (or mean) centering constraint for splines (also w.r.t. gam from mgcv) done?
I answer this as this is the title and as
My question is: Even though the fit is very similar, why do my constrained B-Spline-columns differ from what gam provides? What did I miss?
is rather unclear for reason I provide in the end. Here is the answer to the above question
# simulate data
library(splines)
set.seed(100)
n <- 1000
x <- seq(-4,4,length.out=n)
df <- expand.grid(d = factor(c(0, 1)), x = x)
df <- cbind(y = sin(x) + rnorm(length(df),0,1), df)
x <- df$x
# we start the other way and find the knots `mgcv` uses to make sure we have
# the same knots...
library(mgcv)
mod_gam <- gam(y ~ s(x, bs="ps", k = 7), data = df)
knots <- mod_gam$smooth[[1]]$knots
# find constrained basis as OP describes
X <- splineDesign(knots = knots, x)
C <- rep(1, nrow(X)) %*% X
qrc <- qr(t(C))
Z <- qr.Q(qrc,complete=TRUE)[,(nrow(C)+1):ncol(C)]
XZ <- X%*%Z
rep(1, nrow(X)) %*% XZ # all ~ zero as they should
#R [,1] [,2] [,3] [,4] [,5] [,6]
#R [1,] 2.239042e-13 -2.112754e-13 -3.225198e-13 -6.993017e-14 -2.011724e-13 -3.674838e-14
# now we get roughtly the same basis
all.equal(model.matrix(mod_gam)[, -1], XZ, check.attributes = FALSE)
#R [1] TRUE
# if you want to use a binary by value
mod_gam <- gam(y ~ s(x, bs="ps", k = 7, by = d), data = df)
all.equal(
model.matrix(mod_gam)[, -1],
cbind(XZ * (df$d == 0), XZ * (df$d == 1)), check.attributes = FALSE)
#R [1] TRUE
You can do better in terms of computation speed than explicitly computing
Z <- qr.Q(qrc,complete=TRUE)[,(nrow(C)+1):ncol(C)]
XZ <- X%*%Z
as described on page 211 of
Wood, Simon N.. Generalized Additive Models: An Introduction with R, Second Edition (Chapman & Hall/CRC Texts in Statistical Science). CRC Press.
There are some issues in the OP's code
# drawing the sequence
n <- 100
x <- seq(-4,4,length.out=n)
z <- seq(-4,4,length.out=n)
d <- as.factor(0:1)
library(data.table) # OP did not load the library
data <- CJ(x=x,z=z,d=d)
set.seed(100)
# setting up the model
data[, y :=
# OP only simulate n random terms -- there are 20000 rows
sin(x+I(d==0)) + sin(x+4*I(d==1)) + I(d==0)*z^2 + 3*I(d==1)*z^2 + rnorm(n,0,1)]
# creating the uncentered B-Spline-Basis for x and z
X <- data[,spline(x,min(x),max(x),5,2,by=d,intercept=FALSE)] # gets an error
#R Error in spline(x, min(x), max(x), 5, 2, by = d, intercept = FALSE) :
#R unused arguments (by = d, intercept = FALSE)
str(formals(spline)) # here are the formals for `stats::spline`
#R Dotted pair list of 8
#R $ x : symbol
#R $ y : NULL
#R $ n : language 3 * length(x)
#R $ method: chr "fmm"
#R $ xmin : language min(x)
#R $ xmax : language max(x)
#R $ xout : symbol
#R $ ties : symbol mean
To
My question is: Even though the fit is very similar, why do my constrained B-Spline-columns differ from what gam provides? What did I miss?
then I do not get how you would expect to get the same. You may have used different knots and I do not see how the spline
function would yield the correct results here.
Thw dotted line corresponds to my fit, the straight line to the gam-version
If the latter is fitted with lm
then it is un-penalized so the results should differ?
bs
stands for basis spline. A complete understanding takes a bit of a digression into linear algebra.
First, a natural cubic spline is a very specific, rather rigid, species of curve. natural cubic splines come equipped with a collection of knots $x_1, x_2, \ldots, x_n$, and the definition is as follows.
- To the left of the sequence of knots, a natural cubic spline is a line.
- Between knots, a natural cubic spline is a third degree polynomial curve. Hence the cubic in the name.
- At the knots, the curve must be continuous. At the knots, the derivative also must be continuous (no corner). At the knots, the second derivative must be continuous.
Here's a picture of a natural cubic spline:
Ok, now here's the first answer: including bs in a formula fits a natural cubic spline to your data. It can either:
- Use every one of your data points as a potential knot.
- Determine a sequence of knots with some heuristic, percentiles of the distribution of whatnot.
The fist case may seem insane, and inviting trouble, but there are good theoretical reasons to justify it. In the case you do not specify a degrees of freedom directly, it is possible to determine a "best" answer with cross validation strategies. Leave-one-out cross validation has an especially appealing form for splines (the optimal value can be determined in linear time).
How does the fitting happen? Well, it turns out that the collection of natural cubic splines with a specified set of knots is a vector space. That is, you can add two splines together, or scale a single spline, and what you get back is a spline. This vector space is finite dimensional (convincing yourself of this is a good way to test your understanding). Hence, the set of splines has a basis. Here's a picture of a basis of a space of splines with knots as $.1, .2, \ldots, .9$:
Once you have a basis $s_i$, any other specific spline can be written as a linear combination of the splines in the basis:
$$ s = \sum_i^n \alpha_i s_i $$
So fitting a spline to data goes from finding the best approximating curve from the collection of splines, to finding the values of $\alpha$ that, when combined with a fixed basis, result in a sum spline that best approximates your data.
So when you include bs
in a model formula the following happens:
- Depending on the model you're fitting, and the parameters you passed to
bs
, R chooses a set of knots, and a basis for the collection of splines with that set of knots.
- R takes all the points in your data set, and feeds them into the basis of splines it chose. You can see this with
model.matrix
:
$ dd <- data.frame(x = c(0, 1, 2, 3, 4, 5))
$ model.matrix(~ bs(x, 2), data=dd)
(Intercept) bs(x, 2)1 bs(x, 2)2 bs(x, 2)3
1 1 0.000 0.000 0.000
2 1 0.384 0.096 0.008
3 1 0.432 0.288 0.064
4 1 0.288 0.432 0.216
5 1 0.096 0.384 0.512
6 1 0.000 0.000 1.000
- R uses the resulting vectors as predictors in your model.
So to predict with the model, you ned to know what specific set of knots R chose and what specific basis it chose for the splines at those knots. You should be able to look a the documentation of either bs
or gam
to determine this information in any specific case.
Best Answer
Update If you are a stats newbie like me, this answer may suffice. if you want a more correct answer, see Nukimov's answer.
A much better option is to fit your model using gam() in the mgcv package, which contains a method called Generalized Cross-validation (GCV). GCV will automatically choose the number of knots for your model so that simplicity is balanced against explanatory power. When using gam() in mgcv, turn GCV on by setting k to equal -1.
Just like this:
To plot your smooth line you will have to extract the model fit. This should do the trick:
You can also adjust k manually, and see what number of k brings you closest to the k value set automatically by GCV.