Solved – Centering constraints in MGCV GAM

generalized-additive-modelr

The help pages for MGCV in R states the following:

Note that when using factor by variables, centering constraints are
applied to the smooths, which usually means that the by variable
should be included as a parametric term, as well.

What exactly does this mean? What are centering constraints?

To Gavin:

My understanding is that model bb creates a smooth separately for each value of fac. If the data for each level of fac is separately centered to have mean of zero, is this saying that new data with say fac=='1' with x2==mean(subset(dat,fac=="1")$x2 should have a predicted value of the intercept coefficient? And new data with fac=='2' with x2==mean(subset(dat,fac=="2")$x2 should have the same predicted value? It doesn't work that way so I must not entirely grasp the issue.

set.seed(1234)
dat <- gamSim(4)
bbb<-gam(y ~ s(x2, by=fac), data=dat) ## without fac and x0
summary(bbb) #(Intercept)    1.868 
ndat<-data.frame(fac="1",x2=mean(subset(dat,fac=="1")$x2))
predict(bbb,ndat) #2.393406
ndat2<-data.frame(fac="2",x2=mean(subset(dat,fac=="2")$x2))
predict(bbb,ndat2) #1.933263 
ndat3<-data.frame(fac="3",x2=mean(subset(dat,fac=="3")$x2))
predict(bbb,ndat3) #1.161115

Follow-up:
@Gavin,

If the following code is ran:

library(mgcv)
set.seed(1256)
dat <- gamSim(4)
dat<-dat[dat$fac==1 | dat$fac==2,] #simplify to only 2 levels of the factor

b <- gam(y ~ s(x1,by=fac) -1 , data=dat) #suppress intercept
summary(b)
newsource1<-dat[dat$fac==1,] #subset to fac=1
newsource2<-dat[dat$fac==2,] #subset to fac=2

new1<-data.frame(x1=newsource1$x1,fac=newsource1$fac) #data to predict (original data with fac=1)
new2<-data.frame(x1=newsource2$x1,fac=newsource2$fac) #data to predict (original data with fac=2)


##Plot both cases (fac=1 and fac=2) with vertical line at the respective x1 mean


p<-predict(b,new1) #predict data with fac=1
plot(new1$x1,p)
abline(h=0)
abline(v=mean(new1$x1)) #mean where fac=1

enter image description here

p<-predict(b,new2) #predict data with fac=2
plot(new2$x1,p)
abline(h=0)
abline(v=mean(new2$x1))  #mean where fac=2

enter image description here

While each smooth appears centered over the common mean (mean(dat$x1)), isn't the fact that a very different prediction of y results given (for example, but any x1 value seems to make the same argument)

x1=mean(dat$x1)
fac=1

versus

x1=mean(dat$x1)
fac=2

contrary to the idea that the factor must be added to the model (as a parametric component)
in order to recognize the difference between fac=1 and fac=2 (which I believed to be what Simon Wood was referring to).

Best Answer

What that means is that each smooth created using a by term will be centred at zero. In effect this means that the data for each level of by has the same mean (i.e. the intercept or constant term in the model). What Simon Wood means is that if the data have different mean values (Site A has higher values than Site B say and by = Site) then because of the centring constraints this difference in the means of the different groups will not be taken into account in the model.

Using an example from ?gam.model I will demonstrate the difference:

## Factor `by' variable example (with a spurious covariate x0)
## simulate data...

dat <- gamSim(4)

## fit model...
b <- gam(y ~ fac + s(x2, by=fac) + s(x0), data=dat)
bb <- gam(y ~ s(x2, by=fac) + s(x0), data=dat) ## without fac
summary(b)
summary(bb)

Note how in model b we include the fac variable as a separate parametric term in the model formula as well as in the s( ) part. The intercept in this model will then represent the mean for the first group (level) of fac and the coefficients for fac in the model (and in the summary output) will represent the difference in means from the first group of the other groups.

Look closely at the output from summary(), notice how the deviance explained by b is much higher than that of bb (without the parametric fac term), and consequently b has a lower GCV score (lower is better) and a lower scale parameter. The higher scale parameter in bb is due to there being a large amount of unexplained variance (the between group effect of fac) left over in the simpler model.

R> summary(b)
....
R-sq.(adj) =  0.644   Deviance explained = 65.6%
GCV score = 3.9013  Scale est. = 3.7517    n = 400

R> summary(bb)
....
R-sq.(adj) =  0.451   Deviance explained = 47.3%
GCV score = 6.0322  Scale est. = 5.7758    n = 400

Related Solutions

GAM – Predicting Mean Smooth in Generalized Additive Models with Interactions

The solution suggested by Simon Wood to the simpler problem of predicting the population level effect from a model with random intercepts represented as a smooth is to use a by variable in the random effect smooth. See this Answer for some detail.

You can't do this dummy trick directly with your model as you have the smooth and random effects all bound up in the 2d spline term. As I understand it, you should be able to decompose your tensor product spline into "main effects" and the "spline interaction". I quote these as the decomposition will be to split out the fixed effects and random effects parts of the model.

Nb: I think I have this right but it would be helpful to have people knowledgeable with mgcv give this a once over.

## load packages
library("mgcv")
library("ggplot2")
set.seed(0)
means <- rnorm(5, mean=0, sd=2)
group <- as.factor(rep(1:5, each=100))

## generate data
df <- data.frame(group = group,
                 x = rep(seq(-3,3, length.out =100), 5),
                 y = as.numeric(dnorm(x, mean=means[group]) > 
                       0.4*runif(10)),
                 dummy = 1) # dummy variable trick

This is what I came up with:

gam_model3 <- gam(y ~ s(x, bs = "ts") + s(group, bs = "re", by = dummy) + 
                  ti(x, group, bs = c("ts","re"), by = dummy),
                  data = df, family = binomial, method = "REML")

Here I've broken out the fixed effects smooth of x, the random intercepts and the random - smooth interaction. Each of the random effect terms includes by = dummy. This allows us to zero out these terms by switching dummy to be a vector of 0s. This works because by terms here multiply the smooth by a numeric value; where dummy == 1 we get the effect of the random effect smooth but when dummy == 0 we are multiplying the effect of each random effect smoother by 0.

To get the population level we need just the effect of s(x, bs = "ts") and zero out the other terms.

newdf <- data.frame(group = as.factor(rep(1, 100)), 
                    x = seq(-3, 3, length = 100),
                    dummy = rep(0, 100)) # zero out ranef terms
ilink <- family(gam_model3)$linkinv      # inverse link function
df2 <- predict(gam_model3, newdf, se.fit = TRUE)
ilink <- family(gam_model3)$linkinv
df2 <- with(df2, data.frame(newdf,
                            response = ilink(fit),
                            lwr = ilink(fit - 2*se.fit),
                            upr = ilink(fit + 2*se.fit)))

(Note that all this was done on the scale of the linear predictor and only backtransformed at the end using ilink())

Here's what the population-level effect looks like

theme_set(theme_bw())
p <- ggplot(df2, aes(x = x, y = response)) +
geom_point(data = df, aes(x = x, y = y, colour = group)) +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.1) +
geom_line()
p

And here are the group level smooths with the population level one superimposed

df3 <- predict(gam_model3, se.fit = TRUE)
df3 <- with(df3, data.frame(df,
                            response = ilink(fit),
                            lwr = ilink(fit - 2*se.fit),
                            upr = ilink(fit + 2*se.fit)))

and a plot

p2 <- ggplot(df3, aes(x = x, y = response)) +
geom_point(data = df, aes(x = x, y = y, colour = group)) +
geom_ribbon(aes(ymin = lwr, ymax = upr, fill = group), alpha = 0.1) +
geom_line(aes(colour = group)) +
geom_ribbon(data = df2, aes(ymin = lwr, ymax = upr), alpha = 0.1) +
geom_line(data = df2, aes(y = response))
p2

From a cursory inspection this looks qualitatively similar to the result from Ben's answer but it is smoother; you don't get the blips where the next group's data is not all zero.

Solved – GAM (mgcv): AIC vs Deviance Explained

The respective formulas for these two quantities are: $$\text{deviance} = 2\log\mathcal{L}(\text{saturated model}\, |\, \text{data}) - 2\log\mathcal{L}(\text{model}\, |\, \text{data})$$ $$\text{AIC} = 2k- 2\log\mathcal{L}(\text{model}\, |\, \text{data})$$ where $\mathcal{L}$ is the likelihood and $k$ is the number of model parameters. For a fixed dataset and model family, the saturated model is fixed, and therefore for our purposes the equation for deviance is: $$\text{deviance} = \text{constant} - 2\log\mathcal{L}(\text{model}\, |\, \text{data})$$

Plotting AIC against deviance the way that you've done, we expect the data to fall along a straight line if there exist constants $c_1$ and $c_2$ such that: $$c_1 \cdot \text{AIC} + c_2 \approx \text{Deviance}$$

This can only be the case if $k \propto \log\mathcal{L}$. Although this is not a relationship that I have previously come across, it seems plausible.

However it could also be that a different formula for Deviance is being used altogether, as intimated here.

Best Answer

Related Solutions

GAM – Predicting Mean Smooth in Generalized Additive Models with Interactions

Solved – GAM (mgcv): AIC vs Deviance Explained

Related Question