I am running a RBS with 5 knots and I have 2 difficulties.
-
I get only 5 estimates: 1 for the intercept, one for the linear term, and 3 for the knots. However, since I have 5 knots I should have 4 estimates and I don't understand why I get only 3.
-
This is the first time I run the analysis and I am a bit struggling to interpret the results.
This is my script
# Load required libraries
library(rms)
library(ggplot2)
Import <- Data
# Define the knots
knots <- c(0.05, 0.275, 0.5, 0.725, 0.95)
# Perform Restricted Cubic Splines regression
rcs_model <- lm(N_of_cases ~ rcs(Match_time, knots = knots),
data = Import)
# Generate predicted values
Import$predicted <- predict(rcs_model)
# With CI
# Generate predicted values and confidence intervals
pred <- predict(rcs_model, newdata =
data.frame(Match_time = seq(0, 90, by = 1)),
type = "response", se.fit = TRUE)
# Extract lower and upper confidence intervals
lower_ci <- pred$fit - 1.96 * pred$se.fit
upper_ci <- pred$fit + 1.96 * pred$se.fit
# Create a data frame with Match_time, lower_ci, and upper_ci
pred_df <- data.frame(Match_time = seq(0, 90, by = 1), lower_ci, upper_ci)
# Create a plot to visualize the data, regression line, and confidence interval
ggplot() +
geom_point(data = Import, aes(x = Match_time, y = N_of_cases)) +
geom_line(data = Import, aes(x = Match_time, y = predicted),
color = "blue") +
geom_ribbon(data = pred_df, aes(x = Match_time, ymin = lower_ci,
ymax = upper_ci), alpha = 0.2, fill = "grey") +
labs(x = "Match Time", y = "Number of Cases") +
theme_minimal() +
scale_x_continuous(breaks = c(0, 15, 30, 45, 60, 75, 90))
print(summary(rcs_model))
And these are my results
Call:
lm(formula = N_of_cases ~ rcs(Match_time, knots = knots), data = Import)
Residuals:
Min 1Q Median 3Q Max
-6.3414 -1.7992 0.0484 1.5023 12.9694
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.63759 1.26748 4.448 2.61e-05 ***
rcs(Match_time, knots = knots)Match_time -0.02615 0.08859 -0.295 0.7686
rcs(Match_time, knots = knots)Match_time' 0.99727 0.50780 1.964 0.0528 .
rcs(Match_time, knots = knots)Match_time'' -3.68207 1.57176 -2.343 0.0215 *
rcs(Match_time, knots = knots)Match_time''' 5.02997 2.10133 2.394 0.0189 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.016 on 85 degrees of freedom
Multiple R-squared: 0.2167, Adjusted R-squared: 0.1799
F-statistic: 5.88 on 4 and 85 DF, p-value: 0.0003162
These are the results for print(rcs_model)
Call:
lm(formula = N_of_cases ~ rcs(Match_time, knots = knots), data = Import)
Coefficients:
(Intercept) rcs(Match_time, knots = knots)Match_time
5.63759 -0.02615
rcs(Match_time, knots = knots)Match_time' rcs(Match_time, knots = knots)Match_time''
0.99727 -3.68207
rcs(Match_time, knots = knots)Match_time'''
5.02997
These are the results for anova(rcs_model)
> anova(rcs_model)
Analysis of Variance Table
Response: N_of_cases
Df Sum Sq Mean Sq F value Pr(>F)
rcs(Match_time, knots = knots) 4 213.96 53.490 5.8804 0.0003162 ***
Residuals 85 773.19 9.096
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
These the results when I use ols
> anova(rcs_model)
Analysis of Variance Response: N_of_cases
Factor d.f. Partial SS MS F P
Match_time 4 213.9618 53.490443 5.88 3e-04
Nonlinear 3 185.1462 61.715404 6.78 4e-04
REGRESSION 4 213.9618 53.490443 5.88 3e-04
ERROR 85 773.1938 9.096397
If I am not wrong, the results say that there is a non-linear relationship between the predictor and the outcome. This relationship is statistically significant in the knots with p < 0.05.
If you could confirm this interpretation that would be great!
Best Answer
This is a restricted cabin spline, with the fit restricted to be linear beyond the outermost knots. You are getting the correct number of coefficients for the
rcs()
terms, as there are only 4 intervals within the limits of the 5 knots, and with the restriction to linearity beyond the outermost knots the fit is to smoothly-joined cubic splines just within those 4 intervals. The way thatrcs()
implements the splines, the coefficients are represented as a linear coefficient plus extra terms for the non-linearities. Frank Harrell explains the details in Section 2.4.5 of Regression Modeling Strategies.When there are multiple coefficients associated with a single predictor, the usual
summary()
of the model can be misleading as it only shows the "significance" of each of the coefficients individually. That's why I suggested in comments that you use theols()
andanova()
functions of therms
package. That assesses the overall association ofMatch_time
with outcome (including all its associated coefficients) in the first line of theanova()
output, and evaluates the joint significance of its 3 non-linear terms in the section line.* Yes, in this case you are correct that there is substantial evidence for a non-linearity (p = 4e-04) as you originally inferred from the individual coefficients in the first model summary, but the joint test is more reliable in general.As
Match_time
is the only predictor in the model, the report forREGRESSION
is the same as for that predictor overall. TheMS
in theERROR
line is the mean-squared residual error not "explained" by the predictor.*Note that
anova()
on the model produced bylm()
did not separate out the linear from the non-linear terms. It might do so if you used thens()
implementation of restricted cubic splines from thesplines
package instead ofrcs()
, but I find the default knot locations chosen byrcs()
to be more useful.