GAM – Visualizing Generalized Additive Model Predictions with Baseline Control Values

data visualizationgeneralized-additive-modelmgcvr

I am fitting GAM models to check whether forest treatment (2 types of logging regimes) influence bird abundance across years. Abundance was counted on constant plots. Each plot is located in a constant treatment area. Bird abundance was surveyed in each plot 6 times between 2005-2020.

I created a simplified, reproducible example, which mirrors my dataset and fitted a GAM model:

library(mgcv)
library(ggplot2)
set.seed(1)
plot <- rep(sprintf("p%s",seq(1:18)), each=6)
treatment <- rep(c("control", "treatment1", "treatment2"),each=36)
year <- rep(c(2005,2007,2008,2010,2012,2020), 18)
abundance <- c(sort(runif(36, min = 1, max = 40), decreasing = TRUE), sort(runif(36, min = 1, max = 35), decreasing = TRUE), sort(runif(36, min = 16, max = 32)))
piska_df <- as.data.frame(cbind(plot,treatment,year,abundance))
piska_df$plot <- as.factor(piska_df$plot)
piska_df$treatment <- as.factor(piska_df$treatment)
piska_df$abundance <- as.integer(piska_df$abundance)
piska_df$year <- as.integer(piska_df$year)

g1<-gam(abundance ~  treatment*year + s(plot,bs="re"), data=piska_df, family=poisson, method="REML")

The model works great! But i am now stuck on visualizing main results in a clear manner.

I decided to calculate predicted values for each treatment separately across years, while keeping random factor (“plot”) constraint. Afterwards I transformed my data using inv.logit to get true abundance values for birds. I calculated 95% CI based on SE. This is the code that I used:

year.pr <-seq(min(piska_df$year),max(piska_df$year), length.out = 100)
new_data_ctrl=list(plot=rep("p1",100),
                   treatment=rep("control",100),
                   year=year.pr)
new_data_t1=list(plot=rep("p1",100),
                 treatment=rep("treatment1",100),
                  year=year.pr)
new_data_t2=list(plot=rep("p1",100),
                  treatment=rep("treatment2",100),
                  year=year.pr)
new_data_t2 <- as.data.frame(new_data_t2)
new_data_t1 <- as.data.frame(new_data_t1)
new_data_ctrl <- as.data.frame(new_data_ctrl)

ilink <- family(g1)$linkinv
g.pred.ctrl <- predict(g1,newdata=new_data_ctrl,
                           type="link",se.fit = TRUE)
g.pred.t1 <-predict(g1,newdata=new_data_t1,
                          type="link",se.fit = TRUE)
g.pred.t2 <-predict(g1,newdata=new_data_t2,
                          type="link",se.fit = TRUE)

g.pred.ctrl <- cbind(g.pred.ctrl, new_data_ctrl)
g.pred.ctrl <- transform(g.pred.ctrl, lwr_ci = ilink(fit - (2 * se.fit)),
                              upr_ci = ilink(fit + (2 * se.fit)),
                              fitted = ilink(fit))

g.pred.t1 <- cbind(g.pred.t1, new_data_t1)
g.pred.t1 <- transform(g.pred.t1, lwr_ci = ilink(fit - (2 * se.fit)),
                             upr_ci = ilink(fit + (2 * se.fit)),
                             fitted = ilink(fit))

g.pred.t2 <- cbind(g.pred.t2, new_data_t2)
g.pred.t2 <- transform(g.pred.t2, lwr_ci = ilink(fit - (2 * se.fit)),
                             upr_ci = ilink(fit + (2 * se.fit)),
                             fitted = ilink(fit))

g.pred.all <- rbind(g.pred.t2,g.pred.t1,g.pred.ctrl)

Then I plotted a ggplot graph, using the predicted values:

ggplot(g.pred.all, aes(x = year, y = fitted, colour = factor(treatment))) +
  theme_classic() +
  geom_ribbon(aes(ymin = lwr_ci, ymax = upr_ci, fill = factor(treatment)), alpha = 0.1) +
  geom_line(linewidth=1.5) +
  ggtitle("abundance") + xlab("year")

This is the graph:

And here comes my problem. I am interested in how treatment differs from control – I want it to be main focus of those graphs. I am not interested in general decrease/increase, I am interested in decrease/increase in relation to control.

Therefore I thought it would be a nice idea if I had control as a horizontal 0 line (with respective confidence intervals). Then my Y axis would become “abundance difference from control” instead of “abundance”.

My question would be: how to transform my predictions so that I can show control as straight line going through 0 while maintaining “true” mathematical relations between points and confidence intervals? Can I just calculate difference between all other values & mean control and plot this on the graph? Does it make sense mathematically speaking? Should the CI values be somehow recalculated?

All help would be very valuable. I am also open to any other simple and convenient ways to visualize those results (simple visualization of 3 treatments and their trends over years).

Thank you very much in advance.

Best Answer

Comparing multiple treatments with a control can be done using Dunnett's test. The emmeans package makes this convenient. Some minor tweaks are required because by default, emmeans will compare the groups using ratios instead of differences (i.e. differences on the log-scale). Furthermore, emmeans will adjust $p$-values and confidence intervals using Dunnett's test within each year. Note: You currently estimate the model using restricted maximum likelihood (REML). This is fine if you want unbiased tests for the random effects but suboptimal for comparison of fixed effects. I suggest refitting the model using maximum likelihood (ML) before doing these comparisons. The figure below is for your model fitted with maximum likelihood.

Here is the code (I assume your code has been run before):

library(emmeans)
library(ggplot2)

# Setting up the reference grid using 20 values for year
refgrid <- ref_grid(g1, at = list(year = seq(2005, 2020, length = 20)), regrid = "response")

# Marginal means
em <- emmeans(refgrid, "treatment", by = "year")

# Dunnett's test for each year ("trt.vs.ctrl")
contrs <- contrast(em, "trt.vs.ctrl", infer = c(TRUE, FALSE))

# Convert to data frame for plotting
plot_dat <- as.data.frame(contrs)

Related Solutions

Generalized Additive Model – Using Weights in Binomial Model with GAM (mgcv)

I think you are seeing a difference because of an issue where smooths have difficulty and not any inherent problem in the GLM part of the model; your choice of weights is changing the magnitude of the log-likelihood which is resulting in slightly different models being returned.

I'll get back to that shortly. First, the "problem" goes away if you just fit a common or garden GLM with gam():

library('mgcv')

# Random data
set.seed(1)
x <- 1:100
y_binom <- cbind(rpois(100, 5 + x/2), rpois(100, 100))
w <- sample(seq_len(100), 100, replace = TRUE)

gam_m <- gam(y_binom ~ x, weights = w / mean(w), family = 'binomial')
glm_m <- glm(y_binom ~ x, weights = w / mean(w), family = 'binomial')

Exactly the same model is fitted

> logLik(gam_m)
'log Lik.' -295.6122 (df=2)
> logLik(glm_m)
'log Lik.' -295.6122 (df=2)
> coef(gam_m)
(Intercept)           x 
 -2.1698127   0.0174864 
> coef(glm_m)
(Intercept)           x 
 -2.1698127   0.0174864

and even if you change the magnitude of the log-likelihood by using a different normalization of the weights, you get the same fitted model even though the log+likelihood is different:

gam_other <- gam(y_binom ~ x, weights = w / sum(w), family = 'binomial')

> logLik(gam_other)
'log Lik.' -2.956122 (df=2)
> coef(gam_other)
(Intercept)           x 
 -2.1698127   0.0174864

The behaviour of glm() is that same in this regard:

> logLik(glm(y_binom ~ x, weights = w / sum(w), family = 'binomial'))
'log Lik.' -2.956122 (df=2)

# compare with logLik(gam_other)

This might break down in cases where the optimisation is more marginal, and this is what's happening with gam(). Using my gratia package we can easily compare the two GAMs fitted above:

# using your GAM m2 and m3 as examples
library(gratia)
comp <- compare_smooths(m2, m3)
draw(comp)

which produces

Note that by default, that smooths in those plots include a correction related to bias introduced when the smooth is estimated to be linear.

As you can see, the two fits are different; with one optimization penalising the smooth all the way back to a linear function and the other not quite penalizing as far. With more data, the extra complexity involved in fitting this model over a GLM (where in the GAM we're having to select smoothness parameters), would be overcome and I would expect the change to the log-likelihood to not have such a dramatic effect.

This situation is one where a some of the theory about GAMs starts to get a little looser there's work to try to correct or account for these issues, but often it can be difficult to tell the difference between something that is linear or slightly non-linear on the scale of the link function. Here the true function is slightly non-linear on the scale of the link function but m3 isn't able to identify this, in part I think because the weights are dominating the likelihood calculation.

Data Visualization – Interpreting y-axis Values in Plot(gam)

The model is a generalization of the generalized linear model – it's not a true GLM as we have the extra parameter the defines the extra dispersion that the NB has over the Poisson – and the parameters of the model are estimate on the scale of a link function, in this case the log scale:

$$y_i \sim \mathcal{NB}(\mu_i, \boldsymbol{\theta})$$

where

$$g(\mu_i) = \beta_1 + f_1(\mathtt{Distance}_i)$$

where $g()$ is the link function, which in the case of the NB is typically $\log()$. So we have

$$\log(\mu_i) = \beta_1 + f_1(\mathtt{Distance}_i)$$

and

$$\mu_i = \exp(\beta_1 + f_1(\mathtt{Distance}_i))$$

where we've taken the inverse of the log function to get the expected value of the response $\mu_i$.

When you just do plot(), you get the partial effect of $f_1$, and this is centred about 0 due to the sum-to-zero constraint applied to all smooths. When you used shift, you added on $\beta_1$ which gives us the right hand side of

$$\log(\mu_i) = \beta_1 + f_1(\mathtt{Distance}_i)$$

What you're missing is the bit on the left hand side; these values are on the log scale, where negative values are allowed.

The solution then is to apply the inverse of this link function to the values. This is done via the trans argument to plot.gam().

Hence, for such a simple GAM, you can get what you want via:

plot(model, residuals = TRUE, pch=1, cex=1, seWithMean = TRUE,
     shift = coef(model)[1],
     trans = exp)

where exp is the exponential function, the inverse of the log function. In this case, this will then yield the actual predicted values from the model for a range of values over Distance on the response scale.

Best Answer

Related Solutions

Generalized Additive Model – Using Weights in Binomial Model with GAM (mgcv)

Data Visualization – Interpreting y-axis Values in Plot(gam)

Related Question