Update If you are a stats newbie like me, this answer may suffice. if you want a more correct answer, see Nukimov's answer.
A much better option is to fit your model using gam() in the mgcv package, which contains a method called Generalized Cross-validation (GCV). GCV will automatically choose the number of knots for your model so that simplicity is balanced against explanatory power. When using gam() in mgcv, turn GCV on by setting k to equal -1.
Just like this:
set.seed(1)
dat <- data.frame(y = rnorm(10000), x = rnorm(10000))
library(mgcv)
G1 <- gam(y ~ s(x, k = -1, bs = "cs"), data = dat)
summary(G1) # check the significance of your smooth term
gam.check(G1) # inspect your residuals to evaluate if the degree of smoothing is good
To plot your smooth line you will have to extract the model fit. This should do the trick:
plot(y~x, data = dat, cex = .1)
G1pred <- predict(G1)
I1 <- order(dat$y)
lines(dat$x, G1pred)
You can also adjust k manually, and see what number of k brings you closest to the k value set automatically by GCV.
We can't use covariates in formula in ggplot2 (if we use the tidyverse). So we have to build the models and the fit independently, merge the dataframes, then we can plot our results.
library(tidyverse)
library(mgcv)
set.seed(0)
mydata <- tibble(exposure = 1:20,
covariate = rnorm(20, 0, 0.1),
outcome = rnorm(20, 5, 2))
fit_lm <- mydata %>%
lm(outcome ~ exposure + covariate, data = .) %>%
predict(newdata = mydata, interval = "confidence") %>%
as_data_frame()
fit_gam <- mydata %>%
gam(outcome ~ s(exposure, k = 5) + covariate, data = .) %>%
predict(newdata = mydata, type = "link", se.fit = TRUE) %>%
as_data_frame() %>%
rename(fit_gam = fit) %>%
mutate(lwr_gam = fit_gam - 2 * se.fit,
upr_gam = fit_gam + 2 * se.fit) %>%
select(-se.fit)
mydata %>%
bind_cols(fit_lm, fit_gam) %>%
ggplot() +
geom_point(aes(exposure, outcome)) +
geom_line(aes(exposure, fit), size = 1, color = "red") +
geom_ribbon(aes(x = exposure, ymin = lwr, ymax = upr), alpha = 0.2, fill = "red") +
geom_line(aes(exposure, fit_gam), size = 1, color = "blue") +
geom_ribbon(aes(x = exposure, ymin = lwr_gam, ymax = upr_gam), alpha = 0.2, fill = "blue")
NB : check that the confidence interval for the GAM is correct for you (see Confidence interval for GAM model).
Best Answer
In the documentation for the
mgcv
package, there is a page describing the spline-based smoothers available. Moreover Wood (the package author) offers the following advice:Since in your case you have less than 200 data points, so I don't think you will run into computational issues with the default method. In section 4.1 of Wood's book "Generalized Additive Models: an introduction with R", he has a summary of the major smoothing bases (Thin plate regression splines, Duchon splines, Cubic regression splines, P-splines) available in
mgcv
along with a discussion of their merits and other practical considerations. I have found the book quite helpful in developing my understanding of GAMs.