This is a two-part question.
1) I've read the description of the bs
function in the R splines
package, but I don't think I fully understand what the following results in:
x <- seq(0,2*pi,0.1)
y <- sin(x) + rnorm(length(x),sd=0.1)
bs(x)
What is the mathematical function of bs(x)
? Note that I don't specify knots, degrees of freedom etc.
2) Suppose I obtain the following gam
model (R mgcv
package):
yh <- gam(y~bs(x))
I'd like to make predictions of such model outside R. That is, I'd like to perform, say:
predict(yh,newdata=data.frame(x=2))
in a different program/language. What is the mathematical model of the regression function in this case?
Thank you.
Best Answer
bs
stands for basis spline. A complete understanding takes a bit of a digression into linear algebra.First, a natural cubic spline is a very specific, rather rigid, species of curve. natural cubic splines come equipped with a collection of knots $x_1, x_2, \ldots, x_n$, and the definition is as follows.
Here's a picture of a natural cubic spline:
Ok, now here's the first answer: including bs in a formula fits a natural cubic spline to your data. It can either:
The fist case may seem insane, and inviting trouble, but there are good theoretical reasons to justify it. In the case you do not specify a degrees of freedom directly, it is possible to determine a "best" answer with cross validation strategies. Leave-one-out cross validation has an especially appealing form for splines (the optimal value can be determined in linear time).
How does the fitting happen? Well, it turns out that the collection of natural cubic splines with a specified set of knots is a vector space. That is, you can add two splines together, or scale a single spline, and what you get back is a spline. This vector space is finite dimensional (convincing yourself of this is a good way to test your understanding). Hence, the set of splines has a basis. Here's a picture of a basis of a space of splines with knots as $.1, .2, \ldots, .9$:
Once you have a basis $s_i$, any other specific spline can be written as a linear combination of the splines in the basis:
$$ s = \sum_i^n \alpha_i s_i $$
So fitting a spline to data goes from finding the best approximating curve from the collection of splines, to finding the values of $\alpha$ that, when combined with a fixed basis, result in a sum spline that best approximates your data.
So when you include
bs
in a model formula the following happens:bs
, R chooses a set of knots, and a basis for the collection of splines with that set of knots.model.matrix
:So to predict with the model, you ned to know what specific set of knots R chose and what specific basis it chose for the splines at those knots. You should be able to look a the documentation of either
bs
orgam
to determine this information in any specific case.