Data Visualization – Understanding mgcViz::pterm() in GAM Parametric Coefficients

data visualizationgeneralized-additive-modelggplot2mgcvparametric

I built a GAM with two categorical variables and two smooth terms, following this structure:

model <- ik ~ population_id_cat + s_status + s(n_locs, bs = "re") + s(animal_id, bs = "re")

The model summary for the parametric coefficients is:

Parametric coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      1.90509    0.07798  24.432  < 2e-16 ***
population_id_catB              -1.08790    0.13199  -8.242 2.27e-16 ***
population_id_catC               0.06718    0.08880   0.757 0.449368    
population_id_catD              -0.27599    0.07689  -3.589 0.000336 ***
population_id_catE              -0.33914    0.13555  -2.502 0.012391 *  
population_id_catF              -0.71104    0.07586  -9.373  < 2e-16 ***
population_id_catG              -0.39243    0.07276  -5.393 7.32e-08 ***
population_id_catH               0.31115    0.18431   1.688 0.091449 .  
population_id_catI               0.06530    0.10203   0.640 0.522174    
s_status_2a_m                   -0.16331    0.05263  -3.103 0.001930 ** 
s_status_2fam                   -0.36656    0.05200  -7.050 2.10e-12 ***

When I plot these coefficients with the mgcViz::pterm() function, the values do not seem to match the std. error (but their significance does match with the p-value provided).

For example, the plot for the parametric coefficients mentioned above:

gviz_model <- getViz(model)
gviz_model_1 <- pterm(gviz_model, 1)
plot(gviz_model_1) + l_ciBar(colour = "blue") + l_fitPoints(colour = "red") + l_rug(alpha = 0.3)

enter image description here

Question 1:

What values are actually plotted in this plot? They seem to match the "Estimate" value from the summary, but they do not match the "Std. Error" (e.g., look at population H, according to the std error, the o value would be significant and therefore would be above 0).

If I plot the Estimate+-Std. Error in a ggplot (which is my goal, because I want to plot these parametric values for two models in the same plot), I get this instead:

gviz_model_pop <- termplot(gviz_model, se = TRUE, plot = FALSE)$population_id_cat
ggplot(gviz_model_pop , aes(x=x, y=y)) + 
     geom_errorbar(aes(ymin=y-se, ymax=y+se), width=.15, size=1, position=position_dodge(0.2))  +
     geom_point(position=position_dodge(0.2), size=3.5, shape=16)

enter image description here

Similar trends, but the interval is definitely not the same (and influences the significance of each category).

Question 2:

If I want to plot these parametric coefficients for two models in the same plot, and I would need to use the output from pterm(), instead of extracting the estimates/std error as I did previously, how can I do this? Preferably with a ggplot output.

Question 3:

If I want to plot these parametric coefficients including the intercept (to make the plot more interpretable), as mentioned here in "Transformed standard errors (2)", how do I do it? Do I need separate calculations or is there a function for the parametric coefficients too? Preferably with a ggplot output.

Thank you very much in advance for any help!

Best Answer

Q1

The blue bars are confidence intervals formed (likely) as +/- 2 * Std. err

Q2

I think this would be easier with parametric_effects() from my {gratia} package; what pterm() seems to be doing is wrapping up the model object with some extra information, which doesn't seem to be the information computed for the plot.

Using the example from ?pterm, we have

library("gratia")

set.seed(3)
dat <- gamSim(1,n=1500,dist="normal",scale=20)
dat$fac <- as.factor( sample(c("A1", "A2", "A3"), nrow(dat), replace = TRUE) )
dat$logi <- as.logical( sample(c(TRUE, FALSE), nrow(dat), replace = TRUE) )
bs <- "cr"; k <- 12
b <- gam(y ~ x0 + x1 + I(x1^2) + s(x2,bs=bs,k=k) + fac + x3:fac + I(x1*x2) + logi,
         data=dat)

p <- parametric_effects(b)

Which gives

> p <- parametric_effects(b)                                                  
Interaction terms are not currently supported.
> p                                                                           
# A tibble: 6,005 × 6
   term  type       value level partial    se
   <chr> <chr>   <I<dbl>> <fct>   <dbl> <dbl>
 1 x0    numeric    0.168 NA     -0.310 0.309
 2 x0    numeric    0.808 NA     -1.49  1.48 
 3 x0    numeric    0.385 NA     -0.711 0.708
 4 x0    numeric    0.328 NA     -0.605 0.602
 5 x0    numeric    0.602 NA     -1.11  1.11 
 6 x0    numeric    0.604 NA     -1.12  1.11 
 7 x0    numeric    0.125 NA     -0.230 0.229
 8 x0    numeric    0.295 NA     -0.544 0.541
 9 x0    numeric    0.578 NA     -1.07  1.06 
10 x0    numeric    0.631 NA     -1.17  1.16 
# ℹ 5,995 more rows
# ℹ Use `print(n = ...)` to see more rows

The confidence is easy to add:

p <- p |>
  dplyr::mutate(lower_ci = partial - (1.96 * se), upper_ci = partial + (1.96 * se))

I would also append a model column to indicate which model this is:

p <- p |>
  dplyr::mutate(model = rep("Model 1", nrow(p))

If you repeat that for the other model, to create p2 say, you should be able to bind the two objects together with

p_both <- dplyr::bind_rows(p, p2)

You can also select which terms you want to extract the info for using the terms argument to parametric_effects().

Q3

The best way to do this is to just predict from the model including only the constant term and parametric term of you choice. In the example from ?pterm that I used above, there is a factor variable fac which we'll use to illustrate the point.

# get a data slice where only `fac` varies
# hold all other variables are representative values
ds <- data_slice(b, fac = evenly(fac)) |>
  dplyr::mutate(logi = as.logical(logi)) # <-- work around bug

# use fitted_values() to predict from the model using only the effects
# of `fac` and the constant term
fv <- fitted_values(b, data = ds, terms = c("(Intercept)", "fac"))
fv

There's a bug in data_slice() currently; data_slice() seems to be converting logical variables to numeric. We work around that by coaxing them back to logical. And we only need this to keep predict.gam() happy; none of the other variables but fac will be used in the predictions we computed in fv.

Having done the above fv is:

> fv                                                                          
# A tibble: 3 × 10
  fac      x0    x1    x2 logi     x3 fitted    se  lower upper
  <fct> <dbl> <dbl> <dbl> <lgl> <dbl>  <dbl> <dbl>  <dbl> <dbl>
1 A1    0.511 0.521 0.493 FALSE 0.476   5.75  2.55  0.753 10.8 
2 A2    0.511 0.521 0.493 FALSE 0.476   4.60  2.54 -0.367  9.57
3 A3    0.511 0.521 0.493 FALSE 0.476   2.94  2.63 -2.21   8.09

Here, as fac only had three levels, we get the three values needed for the plot. The fitted column is now has the intercept added to each of the parametric effects, hence A1, being the reference level has the same value as the intercept:

> coef(b)[1]                                                                  
(Intercept) 
   5.753834

while the other values in fitted are (Intercept) plus the respective coefficient for the other levels. The lower and upper columns contain the confidence interval as it would be plotted in the output from mgcViz. Note that fitted_values() returns, by default, predicted values on the response scale. If you have a non-Gaussian family, you can choose to keep the values on the link scale using scale = "link" in the call to fitted_values()