Binomial model with GAM (mgcv) using weights

generalized-additive-modelmgcvr

I have been having an issue when trying to fit a binomial GAM to data. There are two ways these models can be coded, (i) providing a proportion as the response variable, and the number of trials as weights; and (ii) providing two columns, with successes and failures. I have reason to want to weight my data points (independently of the number of samples). However, I have noticed that if I use approach (ii) and add weights (using the weights argument), I get very odd results indeed. Furthermore, if I supply the same weights in relative terms (but different absolute magnitudes), I get very different output. This does not happen when using an equivalent GLM model (or, indeed, when using the gam package). How can I provide a set of weights for the data points?

Here is a MRE:

library('mgcv')

# Random data.
x = 1:100
y_binom = cbind(rpois(100, 5 + x/2), rpois(100, 100))
w = sample(seq_len(100), 100, replace = TRUE)

# GAM models.
m1 = gam(y_binom ~ s(x), family = 'binomial')
m2 = gam(y_binom ~ s(x), weights = w / mean(w), family = 'binomial')
m3 = gam(y_binom ~ s(x), weights = w / sum(w), family = 'binomial')
m4 = gam(y_binom ~ s(x), weights = w * 100, family = 'binomial')

ms = list(m1, m2, m3, m4)

# Different RMSEs.
lapply(X = ms, FUN = function(x) return(sqrt(mean(x$residuals^2))))

# Different predictions, e.g.
plot(predict(m2), predict(m3))


# This does not happen with GLMs.
m1 = glm(y_binom ~ x, family = 'binomial')
m2 = glm(y_binom ~ x, weights = w / mean(w), family = 'binomial')
m3 = glm(y_binom ~ x, weights = w / sum(w), family = 'binomial')
m4 = glm(y_binom ~ x, weights = w * 100, family = 'binomial')

ms = list(m1, m2, m3, m4)

# Same RMSEs (for m2-m4).
lapply(X = ms, FUN = function(x) return(sqrt(mean(x$residuals^2))))

# Same predictions, e.g.
plot(predict(m2), predict(m3))

Best Answer

I think you are seeing a difference because of an issue where smooths have difficulty and not any inherent problem in the GLM part of the model; your choice of weights is changing the magnitude of the log-likelihood which is resulting in slightly different models being returned.

I'll get back to that shortly. First, the "problem" goes away if you just fit a common or garden GLM with gam():

library('mgcv')

# Random data
set.seed(1)
x <- 1:100
y_binom <- cbind(rpois(100, 5 + x/2), rpois(100, 100))
w <- sample(seq_len(100), 100, replace = TRUE)

gam_m <- gam(y_binom ~ x, weights = w / mean(w), family = 'binomial')
glm_m <- glm(y_binom ~ x, weights = w / mean(w), family = 'binomial')

Exactly the same model is fitted

> logLik(gam_m)
'log Lik.' -295.6122 (df=2)
> logLik(glm_m)
'log Lik.' -295.6122 (df=2)
> coef(gam_m)
(Intercept)           x 
 -2.1698127   0.0174864 
> coef(glm_m)
(Intercept)           x 
 -2.1698127   0.0174864

and even if you change the magnitude of the log-likelihood by using a different normalization of the weights, you get the same fitted model even though the log+likelihood is different:

gam_other <- gam(y_binom ~ x, weights = w / sum(w), family = 'binomial')

> logLik(gam_other)
'log Lik.' -2.956122 (df=2)
> coef(gam_other)
(Intercept)           x 
 -2.1698127   0.0174864

The behaviour of glm() is that same in this regard:

> logLik(glm(y_binom ~ x, weights = w / sum(w), family = 'binomial'))
'log Lik.' -2.956122 (df=2)

# compare with logLik(gam_other)

This might break down in cases where the optimisation is more marginal, and this is what's happening with gam(). Using my gratia package we can easily compare the two GAMs fitted above:

# using your GAM m2 and m3 as examples
library(gratia)
comp <- compare_smooths(m2, m3)
draw(comp)

which produces

Note that by default, that smooths in those plots include a correction related to bias introduced when the smooth is estimated to be linear.

As you can see, the two fits are different; with one optimization penalising the smooth all the way back to a linear function and the other not quite penalizing as far. With more data, the extra complexity involved in fitting this model over a GLM (where in the GAM we're having to select smoothness parameters), would be overcome and I would expect the change to the log-likelihood to not have such a dramatic effect.

This situation is one where a some of the theory about GAMs starts to get a little looser there's work to try to correct or account for these issues, but often it can be difficult to tell the difference between something that is linear or slightly non-linear on the scale of the link function. Here the true function is slightly non-linear on the scale of the link function but m3 isn't able to identify this, in part I think because the weights are dominating the likelihood calculation.

Related Solutions

Solved – Creating marginal plots when using mgcv (GAM) package in R

First, make sure that transforming your predictor (or response) variables is actually a reasonable thing to do (see, e.g., here).

Second, you'll have to do it manually if you mean back-transforming a predictor variable to an original scale. If you're talking about the response variable, read this again, and if you still decided to transform the response variable see the trans and possibly shift arguments in ?plot.gam.

But let's assume you're referring to back-transforming a predictor variable. For a marginal plot, setting the other continuous variables at their mean is a popular choice, but that need not be the case. If a predictor variable was highly skewed, for example, you could easily use the median as an option. Or a specific value relevant to the context. You'll have to think carefully about what makes the most sense in your application. Note also that using the mean of a log-transformed variable in your marginal plots puts it at the geometric mean on the original scale.

The same thought process applies to categorical variables --- you could produce the marginal plot for the reference class (as I illustrate below), another class that may be more relevant, or even all classes.

Below we use mgcv's simulation capabilities to generate a data set with 3 continuous predictors (x1, x2, and x3) and a categorical predictor (x0). For simplicity, imagine that x1 and x2 have already been log-transformed, so we're interested in returning them to their original scale in our marginal plots. Since we (pretended to) log-transform x1 and x2, it's a simple matter of exponentiating the values prior to plotting.

library(mgcv)
set.seed(123)

# Assume normal response
# Assume x1 and x2 have been log-transformed
xformed_data <- gamSim(5) 
mod <- gam(y ~ x0 + s(x1) + s(x2) + s(x3), data = xformed_data) 
plot(mod, pages = 1) 

# Marginal plots (transformed and original scale) for x1
# Generate new data set with categorical variable at reference class (i.e., x0 = "1")
# This marginal plot will illustrate the marginal relationship between x1 and y with
# x0 at the refence level, x2 at its geometric mean, and x3 at its mean

x1dat <- data.frame(x0 = factor("1"), 
                    x1 = seq(min(xformed_data$x1), max(xformed_data$x1), length.out = 100),
                    x2 = mean(xformed_data$x2), 
                    x3 = mean(xformed_data$x3))

x1fit <- predict(mod, newdata = x1dat, type = "response", se = TRUE) 

# Marginal plot for x1 on transformed scale; will match output from plot(mod)
# although confidence intervals will not match because uncertainty from other
# smooths and intercept now included
plot(y ~ x1, xformed_data) 
lines(x1dat$x1, x1fit$fit) 
lines(x1dat$x1, x1fit$fit +2 * x1fit$se.fit, lty=2, col = "red") 
lines(x1dat$x1, x1fit$fit -2 * x1fit$se.fit, lty=2, col = "red") 

# Marginal plot for x1 on original scale; will match output from plot(mod)
plot(y ~ exp(x1), xformed_data) 
lines(exp(x1dat$x1), x1fit$fit) 
lines(exp(x1dat$x1), x1fit$fit +2 * x1fit$se.fit, lty=2, col = "red") 
lines(exp(x1dat$x1), x1fit$fit -2 * x1fit$se.fit, lty=2, col = "red") 

# Marginal plots (transformed and original scale) for x2
# Generate new data set with categorical variable at reference class (i.e., x0 = "1")
# This marginal plot will illustrate the marginal relationship between x2 and y with
# x0 at the refence level, x1 at its geometric mean, and x3 at its mean

x2dat <- data.frame(x0 = factor("1"), 
                    x1 = mean(xformed_data$x1),
                    x2 = seq(min(xformed_data$x2), max(xformed_data$x2), length.out = 100), 
                    x3 = mean(xformed_data$x3))

x2fit <- predict(mod, newdata = x2dat, type = "response", se = TRUE) 

# Marginal plot for x2 on transformed scale; will match output from plot(mod)
# although confidence intervals will not match because uncertainty from other
# smooths and intercept now included
plot(y ~ x2, xformed_data) 
lines(x2dat$x2, x2fit$fit) 
lines(x2dat$x2, x2fit$fit +2 * x2fit$se.fit, lty=2, col = "red") 
lines(x2dat$x2, x2fit$fit -2 * x2fit$se.fit, lty=2, col = "red") 

# Marginal plot for x2 on original scale; will match output from plot(mod)
plot(y ~ exp(x2), xformed_data) 
lines(exp(x2dat$x2), x2fit$fit) 
lines(exp(x2dat$x2), x2fit$fit +2 * x2fit$se.fit, lty=2, col = "red") 
lines(exp(x2dat$x2), x2fit$fit -2 * x2fit$se.fit, lty=2, col = "red")

Solved – mgcv::gam overfitting

The problem is what you mean by 'smooth' here. If you really want a curve that is smooth w.r.t. time and passes through the spike in the data at time 1 then it will have to vary enormously on the y scale. But in reality smoothness on a transformed time scale is probably what is wanted here. For example if we assume smoothness on the 4th root of time scale then the plots look much more like what you probably wanted (I've used uneven spacing for t.pred to make sure the rapidly varying region is well represented)...

fit <- gam(data ~ s(I(time^.25), k=6, bs="cr") + s(subject, bs="re") + 
s(time, subject, bs="re"),family=gaussian(), method="REML")  

t.pred <- seq(0,2160^.25,length.out=2000)^4 
fit.pred <- predict(fit, newdata=data.frame(time=t.pred,subject="C"),
type="response",exclude=c("s(subject)","s(time,subject)"), se.fit=F)  
plot(time,data)
lines(t.pred, fit.pred,lwd=2,col="red")

Best Answer

Related Solutions

Solved – Creating marginal plots when using mgcv (GAM) package in R

Solved – mgcv::gam overfitting

Related Question